CN115577109A - Text classification method and device, electronic equipment and storage medium - Google Patents

Text classification method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115577109A
CN115577109A CN202211321452.0A CN202211321452A CN115577109A CN 115577109 A CN115577109 A CN 115577109A CN 202211321452 A CN202211321452 A CN 202211321452A CN 115577109 A CN115577109 A CN 115577109A
Authority
CN
China
Prior art keywords
comment
vector
word
text
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211321452.0A
Other languages
Chinese (zh)
Inventor
王兆麒
姜珊
孙忠刚
张晓谦
王兆麟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
FAW Group Corp
Original Assignee
FAW Group Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by FAW Group Corp filed Critical FAW Group Corp
Priority to CN202211321452.0A priority Critical patent/CN115577109A/en
Publication of CN115577109A publication Critical patent/CN115577109A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/169Annotation, e.g. comment data or footnotes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a text classification method, a text classification device, electronic equipment and a storage medium. The method comprises the following steps: obtaining comment information to be classified, and determining sentences to be used corresponding to the comment information to be classified; determining a to-be-processed vector corresponding to each to-be-used word in the to-be-used sentence based on the comment attribute corresponding to the to-be-used sentence; processing the vectors to be processed based on a predetermined registration corpus to obtain text vectors to be used corresponding to the comment information to be classified; and processing the text vector to be used based on a target text classification model obtained by pre-training to obtain a target classification result. The text classification accuracy in the set field is improved, and the effect of improving the text classification effect is achieved.

Description

Text classification method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of computer processing technologies, and in particular, to a text classification method and apparatus, an electronic device, and a storage medium.
Background
At present, the method for automatically classifying the text is widely applied to various aspects in the field of the Internet, such as webpage classification, sentiment analysis, comment mining and the like. Although text classification has achieved a high degree of accuracy in a wide range of fields, classification in a set field has certain drawbacks, such as the automotive field, the medical field, the legal field, the scientific field, and the like.
In the prior art, when a text is classified, semantic recognition is mainly performed on words in the text, and the semantics of the words are determined to classify the text to be classified. However, the requirements for classification are different for different fields, and the semantic recognition-based method has the problems of inaccurate classification and low classification efficiency, and especially, the text classification applied to the set field often cannot achieve a good classification effect.
Disclosure of Invention
The invention provides a text classification method, a text classification device, electronic equipment and a storage medium, which are used for improving the text classification accuracy in the set field and further achieving the technical effect of improving the text classification effect.
According to an aspect of the present invention, there is provided a text classification method, including:
obtaining comment information to be classified, and determining sentences to be used corresponding to the comment information to be classified;
determining a to-be-processed vector corresponding to each to-be-used word in the to-be-used sentence based on the comment attribute corresponding to the to-be-used sentence;
processing the vectors to be processed based on a predetermined registration corpus to obtain text vectors to be used corresponding to the comment information to be classified;
and processing the text vector to be used based on a target text classification model obtained by pre-training to obtain a target classification result.
According to another aspect of the present invention, there is provided a text classification apparatus including:
the sentence to be used determining module is used for acquiring comment information to be classified and determining sentences to be used corresponding to the comment information to be classified;
the to-be-processed vector determining module is used for determining to-be-processed vectors corresponding to all to-be-used words in the to-be-used sentences based on the comment attributes corresponding to the to-be-used sentences;
the to-be-used text vector determining module is used for processing the to-be-processed vector based on a predetermined registration corpus to obtain a to-be-used text vector corresponding to the to-be-classified comment information;
and the target classification result determining module is used for processing the text vector to be used based on a target text classification model obtained through pre-training to obtain a target classification result.
According to another aspect of the present invention, there is provided an electronic apparatus including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform the method of text classification according to any of the embodiments of the invention.
According to another aspect of the present invention, there is provided a computer-readable storage medium storing computer instructions for causing a processor to implement the text classification method according to any one of the embodiments of the present invention when the computer instructions are executed.
According to the technical scheme of the embodiment of the invention, the comment information to be classified is obtained, and the sentence to be used corresponding to the comment information to be classified is determined; determining a to-be-processed vector corresponding to each to-be-used word in the to-be-used sentence based on the comment attribute corresponding to the to-be-used sentence; processing the vectors to be processed based on a predetermined registration corpus to obtain text vectors to be used corresponding to the comment information to be classified; the method has the advantages that the text vector to be used is processed based on the target text classification model obtained through pre-training to obtain the target classification result, the problems of low text classification accuracy and poor effect caused by semantic recognition in the prior art are solved, the comment attribute corresponding to the comment sentence and words in the registered corpus are introduced into the feature vector of the comment information to be classified, the text classification accuracy in the set field is improved, and the technical effect of improving the text classification effect is achieved.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present invention, nor do they necessarily limit the scope of the invention. Other features of the present invention will become apparent from the following description.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a flowchart of a text classification method according to an embodiment of the present invention;
FIG. 2 is a flowchart of a text classification method according to a second embodiment of the present invention;
FIG. 3 is a flowchart of a text classification method according to a third embodiment of the present invention;
FIG. 4 is a diagram illustrating a text classification method according to a fourth embodiment of the present invention;
FIG. 5 is a diagram illustrating a method for characterizing text vectorization according to a fourth embodiment of the present invention;
FIG. 6 is a diagram illustrating a text classification method according to a fourth embodiment of the present invention;
fig. 7 is a schematic structural diagram of a text classification apparatus according to a fifth embodiment of the present invention;
fig. 8 is a schematic structural diagram of an electronic device implementing the text classification method according to the embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Example one
Fig. 1 is a flowchart of a text classification method according to an embodiment of the present invention, where the embodiment is applicable to a text classification situation, and the method may be implemented by a text classification device, where the text classification device may be implemented in a form of hardware and/or software, and the text classification device may be configured in a computing device. As shown in fig. 1, the method includes:
s110, obtaining comment information to be classified, and determining sentences to be used corresponding to the comment information to be classified.
The comment information to be classified may be text information for which the comment category needs to be determined, for example, the comment information to be classified may be evaluation information about product performance (such as availability, reliability, security, and the like), or may be suggestion information or demand information of a user for a product. The number of the sentences to be used can be one, two or more, and is related to the text information in the comment information to be classified.
In this embodiment, when it is detected that a user triggers a control for posting comment information, the comment information to be classified is considered to be received; or when the comment information of the user is called from a preset position by using the interface, the server is considered to acquire the comment information to be classified. The comment information to be classified can be divided to obtain a plurality of sentences to be used contained in the comment information to be classified, so that the text classification result of the comment information to be classified is determined based on the sentences to be used.
It should be noted that in practical applications, most of the text information in the product review is unstructured data, which may include information of impurity redundancy (such as wrongly written characters or space symbols), and the information of impurity redundancy may affect the classification effect. In order to improve the text classification effect, in the process of determining the to-be-used sentences corresponding to the to-be-classified comment information, sentence division processing can be performed on the to-be-classified comment information to obtain to-be-corrected sentences; correcting the sentences to be corrected to obtain sentences to be filtered; and performing word filtering processing on the statement to be filtered to obtain the statement to be used.
In this embodiment, the text information in the comment information to be classified may be divided into sentences based on spacers (such as punctuations and spaces), semantics, parts of speech, and the like, and at least one sentence may be obtained as a sentence to be corrected. Further, the sentence to be corrected can be corrected by using the regular expression, such as correcting wrongly written characters, contraction sounds and repetition sounds in the sentence, so that the corrected sentence to be corrected is obtained and used as the sentence to be filtered. Furthermore, stop words in the sentence to be filtered can be removed, the stop words can be preset words, such as the, I, you, and the importance of the stop words representing the effect of the preset words on the classification is low. And obtaining the sentence to be filtered after the stop word is removed, and using the sentence as the sentence to be used.
Illustratively, after comment information of a user (i.e., comment information to be classified) is obtained, the comment information to be classified is subjected to sentence division processing to obtain a plurality of different sentences (i.e., sentences to be corrected). And then the regular expression is used for correcting common wrongly written characters, contraction sounds, repeated sounds and the like in the sentence to be corrected. For example: the method comprises the following steps of "U → you", "cuz → because", "& → and", "Plz → Please", "sooo → so", and "thx → thank", wherein words before correction are arranged in front of an arrow, words after correction are arranged behind the arrow, and accordingly, sentences to be filtered corresponding to sentences to be corrected are obtained. Further, the stop word table may be used to delete the words in the sentence to be filtered that are considered to be common, i.e., the stop words (e.g., the, I, you). It should be noted that in the process of deleting the stop word in the to-be-filtered sentence, the emotional verbs such as "can" and "should" are retained, and the emotional verbs can improve the classification accuracy in the sentence, such as "You should add a button" and "This function can't work", in which the emotional verbs should can obviously represent "need"; can't may obviously represent "unable", may play a certain role in text classification. In the technical scheme, the number of the features (words) which must be processed by a subsequent classifier is reduced by using a text reduction strategy, all information which possibly has negative influence on the prediction capability of the classifier is deleted, and the classification accuracy is improved while the text classification efficiency is improved.
S120, determining a to-be-processed vector corresponding to each to-be-used word in the to-be-used sentence based on the comment attribute corresponding to the to-be-used sentence.
The comment attribute can be used for representing characteristic information of the comment, such as emotional information (like, non-feeling, and the like), or comment rating, and the like.
It should be noted that, in practical applications, when a user issues a comment or writes an application comment, some words may be used to express his own feelings, for example, there may be a positive expression or a negative expression.
In this embodiment, the comment attribute of the sentence to be used may be determined by analyzing the semantics of the sentence to be used, and then the feature value corresponding to the comment attribute may be determined based on the comment attribute, for example, the feature values corresponding to different comment attributes may be different, the liked corresponding feature value is 1, and the imperceptible corresponding feature value is 2. Correspondingly, the feature value corresponding to the comment attribute may be added to the feature vector representation corresponding to the word to be used, so as to obtain vector information corresponding to the word to be used, as a vector to be processed. Specifically, the comment attribute includes an emotion attribute. Such as emotional attributes may be positive, negative, neutral, and the like. Based on the comment attribute corresponding to the to-be-used sentence, the implementation manner of determining the to-be-processed vector corresponding to each to-be-used word in the to-be-used sentence may be: performing sentiment analysis on the sentence to be used, and determining the sentiment attribute corresponding to the sentence to be used; determining the reverse file attribute value corresponding to each word to be used in the sentence to be used; and for each word to be used, determining a vector to be processed corresponding to the word to be used at present based on the reverse file attribute value and the sentiment attribute of the word to be used at present.
The reverse file attribute value may be used to characterize the importance of each word to be used in the whole text, for example, the reverse file attribute value may be IDF (Inverse Document Frequency). The mode of determining the vector to be processed corresponding to each word to be used is the same, and any word to be used can be used as the word to be used currently for explanation.
In this embodiment, the emotion analysis model may be used to perform emotion analysis on the to-be-used sentence, so as to obtain an emotion attribute corresponding to the to-be-used sentence. For example, the emotion analysis model may be a VADER (valve Aware Dictionary and sEntiment reader) that can use grammatical and syntactic cues to identify emotional intensity in user comments, in combination with powerful modifiers (e.g., negatives, contractions, conjunctions, reinforcement words, degree adverbs, capitalization, punctuation, etc.) for computing emotion scores for input text. Alternatively, the emotion analysis model may be naive Bayes, LSTM (Long short-term memory) model, or the like.
Text vectorization processing may also be performed on the statement to be used to obtain the feature vector corresponding to the statement to be used, and optionally, the manner of determining the feature vector corresponding to the statement to be used may be: the method comprises the steps of utilizing TF-IDF (term frequency-inverse file frequency) to evaluate the importance degree of a word in a sentence to be used on the whole comment information to be classified or the whole sentence to be used, wherein for example, if the occurrence frequency of a certain word is very small, the word can be considered to have good classification capacity and can be used for classification, and the importance degree is higher. Correspondingly, the IDF corresponding to each word to be used in the sentence to be used, that is, the attribute value of the reverse file, can be obtained. The attribute value of the reverse file can be used as a feature vector corresponding to a word to be used, further, a final text vector of the word to be used can be determined based on the feature vector of the word to be used and the corresponding emotion attribute, and for example, the feature vector and the feature value corresponding to the emotion attribute can be spliced to obtain a spliced vector which is used as a vector to be processed corresponding to the word to be used.
Illustratively, the emotion of a user for a certain product, event or point of view can be mined through emotion analysis technology, and each given sentence in the comment information of the user can be assigned to a corresponding emotion category, such as positive, negative and neutral emotion categories. For example, for statement 1 to be used: "The low value of The car keys me sad. "; statement to use 2: "I 'mso exposed about the power in this car, it's going to go to heaven! ". The expression of negative emotions using the emotional word "sad" in statement 1 is a negative expression on the aspect of product design. Statement 2 is to express a positive emotion by the adjective "exposed". To identify emotions in The text of a comment, the VADER is used to identify emotional intensity in The user comment using grammatical and syntactic cues, such as punctuation marks (e.g., the number of exclamation marks), capital letters (e.g., "I HATE THIS GAME" is stronger than "I HATE THIS gate"), degree modifiers (e.g., "The new sync feature is extreme good" is considered stronger than "The new sync feature is good"), the constructive conjunctions "but" may change polarity, and negative words (e.g., "The app isn't real all great") are determined by examining tri-alphabetical combinants. The VADER can output emotion values (i.e., feature values) of negative (negative), positive (positive), neutral (neutral) and other emotion categories, and the output value is between-1 (very negative) and 1 (very positive), and the overall emotional state can be reflected by the four emotion categories. Accordingly, it can be determined whether each sentence to be used is positive, neutral, or negative, resulting in its emotion classification. The TF-IDF value (i.e., the inverse file attribute value) of each word to be used in the sentence to be used can also be calculated as a text data vector using the TF-IDF. And splicing the text data vector of the word to be used and the emotion value of the corresponding emotion category to obtain the vector to be processed of the word to be used.
S130, processing the vectors to be processed based on a predetermined registration corpus to obtain text vectors to be used corresponding to the comment information to be classified.
The registration corpus is constructed in advance, and comprises related words and vector representations corresponding to the words in a set field, wherein the related words correspond to the comment categories, and the set field corresponds to an application field corresponding to comment information to be classified, and can be an automobile field or a catering field.
In this embodiment, the matching degree between the vector to be processed and each word stored in the registration corpus may be calculated, and the vector to be processed whose matching degree is higher than the preset threshold may be extracted to generate the text vector to be used. For example, the matching degree of the vector a to be processed and the stored word B is 94%, and is greater than 90%, and the vector a to be processed can be used as a word vector required for subsequent classification.
It should be noted that, in order to ensure the comprehensiveness of the text data and improve the matching effect between the vector to be processed and each word in the registered corpus. The similarity between different words can be determined by utilizing the semantic relation between the words, and the higher the similarity is, the more similar the two words are represented. Optionally, processing the to-be-processed vector based on a predetermined registration corpus to obtain a to-be-used text vector corresponding to the to-be-classified comment information, including: determining a to-be-registered vector corresponding to each to-be-registered word in a registration corpus, and respectively determining the similarity between each to-be-registered vector and a to-be-processed vector; if the similarity is greater than a preset threshold value, marking the corresponding vector to be processed as a domain-related vocabulary vector; and generating a text vector to be used based on the domain-related vocabulary vector and the vector to be processed.
Wherein the similarity is used for representing the similarity degree of the two words. It should be noted that a word vector generation technology may be used to generate a feature vector corresponding to each word to be registered, that is, a vector to be registered, and store the feature vector into the registration corpus. For example, text data may be converted into word vectors using the word2vec CBOW model using the generic tool. The domain-related vocabulary vector is the vocabulary vector related to the set domain in the comment information to be classified determined by matching the registration corpus of the set domain with the vector to be used.
In this embodiment, after determining the vectors to be processed corresponding to the words to be used, the similarity between the vectors to be processed and each vector to be registered may be calculated, the vectors to be processed having a similarity higher than a preset threshold may be used as the domain-related vocabulary vectors, the vectors to be processed except the domain-related vocabulary vectors in the comment information to be classified may be marked as comment vocabulary vectors, and the text vectors to be used may be generated based on the domain-related vocabulary vectors and the comment vocabulary vectors. It should be noted that there are various ways to determine the similarity between the vector to be processed and each vector to be registered, for example, a path similarity calculation method may be used to obtain the semantic similarity between the vector to be processed and each vector to be registered, and the semantic similarity value may be represented by the shortest path between two words; or a cosine similarity calculation method is utilized to obtain cosine distances between the vectors to be processed and the vectors to be registered, the cosine distances are used for representing similarity values, and the larger the cosine value between the word vectors is, the higher the semantic similarity of the two words is; the smaller the cosine value between the word vectors, the lower the semantic similarity of the two words. For example, words to be registered in the whole registration corpus may be traversed, similarity calculation is performed between a word vector of each word to be registered (i.e., a vector to be registered) in the corpus and each vector to be processed, and the vector to be processed with the similarity larger than a threshold (e.g., 0.8) is added to a text vector to be used as a domain-related word vector. The method and the device have the advantages that the domain related words in the comment information to be classified are found by matching each word in the comment information to be classified with the registration corpus corresponding to the set domain, so that the subsequent classification is carried out based on the domain related words, and the classification accuracy is improved.
And S140, processing the text vector to be used based on a target text classification model obtained through pre-training to obtain a target classification result.
The target text classification model is a model which is trained in advance and used for text classification. The target classification result may be used to characterize a review category, such as performance evaluation, error feedback evaluation, demand feedback evaluation, user experience evaluation, and the like.
In this embodiment, the text to be used may be used as an input of the target text classification model, and the comment category to which the text belongs may be output as the target classification result. Specifically, the method for processing the text vector to be used based on the target text classification model obtained by pre-training to obtain the target classification result includes: obtaining target word vectors corresponding to the vectors to be applied based on the vectors to be applied in the text vectors to be used and the corresponding weight values; splicing the target word vectors to obtain target text vectors corresponding to the text vectors to be used; and inputting the target text vector into a target text classification model to obtain a target classification result.
The vector to be applied may be a domain-related vocabulary vector, or a vector to be processed other than the domain-related vocabulary vector.
In practical applications, a weight value corresponding to the domain-related vocabulary vector may be set as a first weight value, and a weight value of a to-be-processed vector other than the domain-related vocabulary vector may be set as a second weight value. In order to improve the attention of the domain-related words and improve the classification effect, the first weight value may be greater than the second weight value. And amplifying the related vocabulary vectors in each field in the text vector to be used according to a first weight value, and amplifying the rest vocabulary vectors according to a second weight value. Correspondingly, vectors to be subjected to vector weighting processing are obtained and used as target word vectors. Furthermore, the target word vectors can be spliced to obtain a final text vector as a target text vector. The target text vector can be input into the target text classification model, and a target classification result corresponding to the comment information to be classified is obtained.
According to the technical scheme, the comment information to be classified is obtained, and the sentence to be used corresponding to the comment information to be classified is determined; determining a to-be-processed vector corresponding to each to-be-used word in the to-be-used sentence based on the comment attribute corresponding to the to-be-used sentence; processing the vectors to be processed based on a predetermined registration corpus to obtain text vectors to be used corresponding to the comment information to be classified; the method has the advantages that the text vector to be used is processed based on the target text classification model obtained through pre-training to obtain the target classification result, the problems of low text classification accuracy and poor effect caused by semantic recognition in the prior art are solved, the comment attribute corresponding to the comment sentence and words in the registered corpus are introduced into the feature vector of the comment information to be classified, the text classification accuracy in the set field is improved, and the technical effect of improving the text classification effect is achieved.
Example two
Fig. 2 is a flowchart of a text classification method according to a second embodiment of the present invention, and based on the foregoing embodiment, a registration corpus can be predetermined. The technical scheme of the embodiment can be referred to for the specific implementation mode. The technical terms that are the same as or corresponding to the above-mentioned embodiments are not described in detail herein.
As shown in fig. 2, the method specifically includes the following steps:
s210, obtaining comment information to be used of at least one comment dimension, and determining comment sentences to be used corresponding to the comment information to be used.
It should be noted that, in different fields, the comment expressions of the users are different, the information that the users focus on is also different, and the classification results of the corresponding comment information are greatly different. The comment dimension is related to the attention information of the user in a set field, for example, in the automobile field, the comment dimension may be a performance dimension, an error feedback dimension, a demand dimension, a user experience dimension, and the like. The comment information to be used can be comment information generated by the user in each comment dimension, such as performance evaluation issued by the user for product performance.
In this embodiment, the user comment information in different comment dimensions can be collected as comment information to be used, for example, the comment information in the performance dimension can be a user comment related to the performance of the automobile, and includes problems such as power, driving stability, accelerator or brake response and the like during the running of the automobile; the comment information in the error feedback dimension may include some error reports and problems that occur when the automobile is used and are fed back by the user, for example: button failure, operational failure, etc. Comment information in the demand dimension can be characteristic demands, including some suggestions of users or some automobile functions that wish to be added, suggestions for improving automobile applications, and the like, such as: add functions, improve functions, request, etc. The comment information of the user experience dimension comprises the design of a user on a vehicle model or the user comment of a certain element expressing emotion, such as: comfort, mood, convenience, quickness, etc. Furthermore, the comment information to be used can be divided into sentences to obtain the corresponding comment sentences to be used in each comment dimension.
S220, screening and processing each comment sentence to be used based on the product attribute corresponding to the comment information to be used to obtain a target comment word.
The product attribute can be used for representing attention information of product field evaluation, for example, assuming that the product attribute is an automobile, the attention information can be automobile functions, use experience, automobile design and the like; if the product attribute is food, the attention information may be food taste, food color and luster and the like, and the attention information of the evaluation corresponding to different product attributes may be different.
In this embodiment, words in each comment sentence to be used may be screened based on the product attribute corresponding to the comment information to be used, and related words that may represent the product attribute may be screened out as target comment words. For example, after obtaining the comment information to be used, the comment information may be divided by means of manual marking, and the comment category of each sentence (i.e., comment sentence to be used) may be determined. And then selecting the field related words related to the product attributes from each sentence as target comment words, and taking the target comment words as basic words for generating the registration corpus so as to automatically expand the registration corpus based on the target comment words.
Illustratively, for the comment sentences "the re wire always be in the sound system of sound consistent with the sound system, while the user shows that the user is not satisfied with the music playing and sounding device of the automobile and the music playing effect during the running of the automobile through the two words" Jams "and" sound ", while the two words may not have great significance in other fields, but can play a key role in expressing the user's needs in the automobile field, and under the influence of the automobile field, the two words can be extracted as target comment words to prepare for the subsequent knowledge expansion of the field.
And S230, determining similar words corresponding to the target comment words based on the words to be used in the predetermined corpus to be used.
The corpus to be used may be preset and include a plurality of vocabularies to be used, and the corpus to be used may be the corpus to be used. The vocabulary to be used may be words to be selected to be populated into the registered corpus.
In this embodiment, the similarity between the target comment word and each word to be used may be calculated, and the word to be used whose similarity is higher than the set threshold may be used as the similar word similar to the target comment word. Correspondingly, each similar vocabulary corresponding to each target comment word can be obtained. Specifically, based on each word to be used in the predetermined corpus to be used, an implementation manner of determining a similar word corresponding to the target comment word may be: the similarity can be represented by the shortest distance by calculating the shortest distance between the target comment word and each vocabulary to be used, the output range is 0 to 1, and the vocabulary to be used with the similarity larger than a threshold (such as 0.7) is added into the registered corpus as the similar vocabulary. Or/and respectively converting each vocabulary to be used and each target comment word into a word vector, and respectively calculating the similarity between the word vector of each word in the corpus to be used and the word vector of each target comment word, for example, judging the similarity between words according to the cosine distances of different word vectors. If the cosine value between the word vectors is larger, the semantic similarity of the two words is higher; if the cosine value between the word vectors is smaller, the semantic similarity is lower, and the vocabulary to be used with the similarity larger than a threshold (such as 0.7) is added into the registered corpus as the similar vocabulary.
S240, determining the registration corpus based on the target comment words and the similar words.
Specifically, a domain dictionary, i.e., a registration corpus, corresponding to the product attribute may be generated based on each target comment word and each similar word, so as to process the to-be-processed vector based on the registration corpus, and obtain a to-be-used text vector corresponding to the to-be-classified comment information.
And S250, obtaining comment information to be classified, and determining a sentence to be used corresponding to the comment information to be classified.
S260, determining to-be-processed vectors corresponding to the to-be-used words in the to-be-used sentences based on the comment attributes corresponding to the to-be-used sentences.
And S270, processing the vectors to be processed based on a predetermined registration corpus to obtain text vectors to be used corresponding to the comment information to be classified.
S280, processing the text vector to be used based on a target text classification model obtained through pre-training to obtain a target classification result.
According to the technical scheme, comment information to be used of at least one comment dimension is obtained, and comment sentences to be used corresponding to the comment information to be used are determined; screening each comment sentence to be used based on the product attribute corresponding to the comment information to be used to obtain a target comment word; determining similar words corresponding to the target comment words based on each word to be used in a predetermined corpus to be used; the method comprises the steps of determining a registration corpus based on target comment words and similar words, screening comment information of comment dimensions by utilizing product attributes to obtain the target comment words in accordance with a set field, automatically expanding the target comment words to generate the registration corpus, improving the richness of comment data corresponding to the set field, and further improving classification accuracy.
EXAMPLE III
Fig. 3 is a flowchart of a text classification method according to a third embodiment of the present invention, and based on the foregoing embodiment, a target text classification model may be obtained through pre-training. The specific implementation manner can be referred to the technical scheme of the embodiment. The technical terms that are the same as or corresponding to the above embodiments are not repeated herein.
As shown in fig. 3, the method specifically includes the following steps:
and S310, acquiring a training sample set.
The training sample set comprises a plurality of training samples, and the training samples are used for participating in model training. In order to enable the obtained target text classification model to have higher accuracy, training samples can be obtained as many as possible and abundantly, and therefore the trained target text classification model is obtained.
Optionally, the implementation manner of obtaining the training sample set may be: obtaining historical comment information, and determining historical comment sentences corresponding to the historical comment information; determining a historical comment vector corresponding to each historical comment word in the historical comment sentences based on the comment attributes corresponding to the historical comment sentences; determining historical vectors to be used corresponding to the historical comment information based on the similarity between each vector to be registered in the registered corpus and the historical comment vectors; determining a field-related word vector corresponding to the historical to-be-used vector based on the historical to-be-used vector and the to-be-registered vector corresponding to each word to be registered in the registration corpus, and performing weighting processing on the field-related word vector to obtain a target related word vector; and determining a training sample of the target text classification model obtained by training based on the historical to-be-used vector, the target related word vector and the corresponding theoretical comment label.
The historical comment information can include user comment information under different comment dimensions. The theoretical comment tags can be understood as tags of comment categories corresponding to text features in comment information, and can be manually labeled, for example, the theoretical comment tags can be category tags of performance, error feedback, requirements, user experience and the like, for example, the tags of the performance types can be A1; the error feedback type tag may be B1.
In this embodiment, in order to obtain a target text classification model for determining comment categories of comment information to be classified, comment information (i.e., historical comment information) of users under different comment categories in a set field corresponding to the comment information to be classified may be collected, where the historical comment information includes evaluation information on product performance in the set field, demand information on product functions, evaluation information on product use experience, feedback on product error reports and problems, and the like. In order to improve the training precision of the model, the historical comment information can be processed in a sentence dividing mode to obtain a plurality of sentences, then words in each sentence can be corrected by using regular expressions, for example, wrongly written characters, contraction sounds, repetition sounds and the like in the sentences are corrected to obtain corrected sentences, further, word filtering processing can be performed on the corrected sentences to remove stop words in the corrected sentences, the stop words can be preset words such as the, I, you and the like, and then the sentences after the stop words are removed to serve as the historical comment sentences. Further, the historical comment sentences may be subjected to sentiment analysis by using a sentiment analysis model (such as a VADER), so as to obtain comment attributes corresponding to the historical comment sentences. And evaluating the importance degree of the history comment words in the history comment sentences to the whole history comment sentences by using TF-IDF to obtain an IDF value corresponding to each history comment word, taking the IDF value as a text data vector corresponding to the history comment word, and combining the text data vector of the history comment word and the corresponding comment attribute to obtain the history comment vector corresponding to the history comment word. Furthermore, the similarity between the historical to-be-used vector and each to-be-registered vector stored in the registered corpus can be calculated, the historical to-be-used vector with the similarity higher than a preset threshold value is used as a field related word vector, the field related word vector can be amplified according to a certain percentage, the obtained field related word vector after vector amplification is used as a target related word vector, and the effect of attaching importance to the field related words is achieved. Historical to-be-used vectors other than the domain-related word vectors may be tagged as review word vectors, and the review category of each word vector may serve as a theoretical review label. Correspondingly, a plurality of training samples can be obtained based on the comment word vectors, the target related word vectors and the corresponding theoretical comment labels, so that a target text classification model can be obtained based on training of the training samples.
And S320, for each training sample, inputting the historical to-be-used vector of the current training sample into the to-be-trained text classification model to obtain the actual comment category.
It should be noted that, when it is required to determine the actual comment category corresponding to each training sample, the current training sample may be understood as that the actual comment category that determines any training sample may be processed as the actual comment category that determines the current training sample, that is, one of the training samples may be described as the current training sample. The text classification model to be trained may be an SVM (Support Vector Machine). The model parameters of the text classification model to be trained may be default values. The text classification model to be trained can be trained based on each training sample to obtain a trained target text classification model.
Specifically, each training sample may be input to the to-be-trained text classification model, that is, each training sample may be used as an input parameter of the to-be-trained text classification model, and the to-be-trained text classification model may process the current training sample, for example, comment classification processing may be performed, so that a classification label corresponding to the current training sample, that is, an actual comment category, may be obtained, and accordingly, an actual comment category corresponding to each training sample may be obtained.
S330, training the text classification model to be trained based on the actual comment category and the theoretical comment label corresponding to the historical vector to be used in the current training sample.
It should be noted that, in this embodiment, a current training sample may be processed by using a classification technology, and the to-be-trained text classification model may output an actual comment category corresponding to the current training sample. Because the model parameters in the text classification model to be trained are not corrected, the output classification comment category is correspondingly different from the theoretical comment label corresponding to the current training sample. Based on the actual comment category output by the model and the corresponding theoretical comment label, an error value can be determined, and further based on the error value, model parameters in the text classification model to be trained can be corrected.
In practical application, a current training sample can be input into a text classification model to be trained to obtain an actual comment category corresponding to the current training sample, for example, the actual comment category can be a performance evaluation identifier, the currently output performance evaluation identifier is compared with a theoretical comment label marked manually, a similarity error value, namely a loss result, is calculated, and then model parameters of the text classification model to be trained can be corrected based on the loss result.
S340, converging the loss function in the text classification model to be trained to serve as a training target, and obtaining the target text classification model.
The training target refers to model training aiming at achieving the convergence of a preset loss function.
Specifically, the training error of the loss function, that is, the loss parameter, may be used as a condition for detecting whether the loss function reaches convergence currently, for example, whether the training error is smaller than a preset error or whether an error change trend tends to be stable, or whether the current iteration number is equal to a preset number. If the detection reaches the convergence condition, for example, the training error of the loss function is smaller than the preset error or the error change tends to be stable, which indicates that the training of the text classification model to be trained is completed, the iterative training may be stopped at this time. If the convergence condition is not reached currently, training samples can be further obtained to train the text classification model to be trained continuously until the training error of the loss function is within a preset range. When the training error of the loss function is converged, the text classification model to be trained can be used as a target text classification model, so that when comment information to be classified is input into the trained target text classification model, the model can accurately identify and output a target classification result corresponding to the comment information to be classified.
S350, obtaining comment information to be classified, and determining sentences to be used corresponding to the comment information to be classified.
S360, determining to-be-processed vectors corresponding to the to-be-used words in the to-be-used sentences based on the comment attributes corresponding to the to-be-used sentences.
And S370, processing the vectors to be processed based on a predetermined registration corpus to obtain text vectors to be used corresponding to the comment information to be classified.
And S380, processing the text vector to be used based on a target text classification model obtained through pre-training to obtain a target classification result.
According to the technical scheme of the embodiment, a training sample set is obtained; for each training sample, inputting a historical to-be-used vector of a current training sample into a to-be-trained text classification model to obtain an actual comment category; training a text classification model to be trained based on the actual comment category and a theoretical comment label corresponding to a historical vector to be used in a current training sample; the method comprises the steps of taking loss function convergence in a text classification model to be trained as a training target to obtain a target text classification model, leading comment attributes corresponding to sentences and words in a registration corpus into feature vectors of training samples, and then training based on the training samples to obtain the target text classification model, so that the precision of the model is improved, and the technical effect of improving the accuracy of model classification is achieved.
Example four
As an alternative embodiment of the foregoing embodiment, fig. 4 is a schematic diagram of a text classification method according to a fourth embodiment of the present invention. Specifically, the following details can be referred to.
Referring to fig. 4, in the technical solution, the target text classification model may be obtained by obtaining a training sample training classifier, and a specific implementation manner may be: collecting user comment information under a plurality of comment categories, wherein the comment categories comprise: performance problems, crash problems (bug report), feature requests, user experience problems, and comment information under the category of performance problems includes user comments related to the performance of the automobile (such as problems of power, driving stability, accelerator or brake response and the like when designing the automobile); the comment information under the crash problem comprises error reports, problems and the like which are fed back by the user and occur when the automobile is used, such as: button failure, operational failure, etc. Comment information under the feature request includes some suggestions of the user or some automobile functions that wish to be added, suggestions for improving the automobile application, etc., such as: add functions, improve functions, request, etc. Comment information under the user experience problem comprises the design of a user on a vehicle model or user comments of a certain element expressing emotion, such as: comfort, mood, convenience, quickness, etc. Further, text preprocessing may be performed on the collected user comment information, including: 1. comment clauses: and (4) segmenting the user comment information (namely the historical comment information) to obtain each historical comment sentence corresponding to each comment type. 2. Correcting wrongly written characters: the regular expressions are used for correcting wrongly written words, contracted sounds, repeated sounds and the like in the historical comment sentences, for example: "U → you", "cuz → because", "& → and", "Plz → please", "sooo → so", and "thx → thank", wherein the arrow is preceded by a word before correction and followed by a word after correction. 3. Removing stop words: the stop words (e.g., the, I, you) in the corrected statement are deleted using the stop word table, resulting in a processed historical review statement. It should be noted that, in the process of deleting the stop word, the emotional verbs such as "can" and "should" in the corrected sentence are retained, and the sentence including the emotional verbs can improve the classification accuracy, for example, "You should add a button" and "This function can't work", in which the emotional verbs should can obviously represent "need"; can't may obviously represent "unable" and may play a role in text classification. Meanwhile, the text reduction strategy is used, so that the quantity of features (words) which must be processed by the classifier can be reduced, all information which possibly has negative influence on the prediction capability of the classifier is deleted, and the classification accuracy is improved. And performing emotion analysis on the processed historical comment sentences by using the VADER to obtain emotion values corresponding to the historical comment sentences, and reflecting the emotional states of the users through the emotion values. And the TF-IDF can be used for evaluating the importance degree of the words in the historical comment sentences to the whole to-be-classified comment information or the whole to-be-used sentence, and the TF-IDF value of each word in the historical comment sentences is calculated to serve as a text data vector. The text data vector and the emotion value of the history comment words can be spliced to obtain a history to-be-used vector (namely the history comment vector) of the history comment words. Furthermore, the similarity between the historical to-be-used vector and each to-be-registered vector stored in the registration corpus can be calculated, the historical to-be-used vector with the similarity higher than a preset threshold value is used as the field related word vector, and the field related word vector can be amplified according to a certain percentage, so that the effect of attaching importance to the field related words is achieved. And marking the historical to-be-used vectors except the domain-related word vectors as comment word vectors, wherein the comment category of each word vector can be used as a theoretical comment label. Correspondingly, a plurality of training samples can be obtained based on the comment word vectors, the field-related word vectors after vector amplification processing and the corresponding theoretical comment labels, the training samples can be used as the input of a classifier, so that the classifier is trained, and the trained classifier can be used as a target text classification model. According to the technical scheme, the text classification effect of the set field is improved by introducing the comment attributes of the words and the field-related words in the registration corpus into the text vectorization process. For example, see the schematic diagram for characterizing a text vectorization method shown in fig. 5, for the automotive review: the power of this car indexes to be improved is converted into a text vector, which can be- -, a- -, b- -, and the numerical value m of the comment attribute (such as the sentiment attribute) and the domain related words are introduced into the text vectorization to obtain- -, ka- -, kb- -, m. Wherein, a and b represent the related word vector of the field, K represents weight, and Ka represents the related word vector of the field after the vector is amplified.
On the basis of the above scheme, the registration corpus may be predetermined in a manner that: first, a small number of comments are randomly extracted from a corpus, and the comments are divided by means of manual labeling to determine the category of each sentence. The method includes the steps of picking out a domain-related word from each sentence as a seed word of a registered corpus, namely a target comment word, for example, for a comment sentence "the re wire always be in the recording system with sound of the recording system, while the user shows that the playing effect of the user on music playing and sound equipment of a car and music playing in running of the car is not satisfactory through two words" Jams "and" sound ", and the two words may not have great significance in other fields, but can play a role in expressing user requirements in the field of the car and can be extracted as the target comment word under the influence of the field of the car. The registration corpus can be automatically expanded by using the seed words, for example, the similarity between words can be obtained by using a dictionary and a word embedding mode respectively, and the words with high similarity are filled into the registration corpus as similar words. Illustratively, with Wordnet extensions, wordnet technology is to divide words into five classes: nouns, verbs, adjectives, adverbs, and fictitious words. Semantic relationships include synonymy, antisense, and the like. Similarity between the target comment words and each word to be used can be calculated through semantic relations, the similarity is represented by the shortest distance, and the word to be used with the similarity larger than a threshold (such as 0.7) can be added to the registered corpus as a similar word. Word2vec expansion can also be used, and Word2vec is a method for vectorizing texts by Word embedding, and similarity between words can be judged according to cosine distances of different Word vectors. The larger the cosine value between the word vectors is, the higher the semantic similarity of the two words is; the smaller the cosine values between the word vectors, the lower the semantic similarity. For example, a targeted comment word may be converted to a word vector using the word2vec CBOW model using the gensim tool. Traversing the words of the whole corpus to be used, respectively calculating the similarity between the word vector of each word to be used in the corpus and the target comment word in each comment category, and adding the word to be used with the similarity larger than a threshold (such as 0.7) into the registration corpus.
In order to make the technical solution of the embodiment of the present invention further clear to those skilled in the art, a specific application scenario example is given, referring to fig. 6, before training a classifier, an operation of data collection may be performed, that is, comment information is collected, for example, collection of text data or local file import may be performed on a network. And further carrying out data preprocessing on the acquired data, wherein the preprocessing can be a series of operations such as text marking processing, word removal and stop, word segmentation, word stem extraction and the like. Wherein, processing the text label: the performance of some special symbol interference classifiers is removed. Stop words removal: words such as subjects, pronouns, articles and the like in the text are eliminated. Word segmentation: and segmenting the text based on the interval mark symbols to obtain individual words for subsequent processing. Extracting a stem: for example, the words "present", "presented", "presenting", and "presenting" are the same, and stem extraction can be performed on them to unify them into "present", thereby reducing the data processing amount. The advantages of pretreatment are: the later data can be better represented, the storage space is reduced, the calculation cost is reduced, and the text classification accuracy is improved. Further, the preprocessed text information may be characterized, that is, the text is converted into computer-recognizable structured data, which is also called text vectorization. For example, text may be converted into vectors using a Bag-of-words (Bag-of-words) model and a word embedding (word embedding) model. Further, the processed vector may be used for model training. For example, nonlinear processing may be performed based on a model algorithm of convolutional deep learning, the input of which is typically a representation of a feature vector, such as a word. After model training, the model performance may be evaluated using the test data to perform text classification based on the evaluated model.
According to the technical scheme, comment information to be classified is obtained, and sentences to be used corresponding to the comment information to be classified are determined; determining a to-be-processed vector corresponding to each to-be-used word in the to-be-used sentence based on the comment attribute corresponding to the to-be-used sentence; processing the vectors to be processed based on a predetermined registration corpus to obtain text vectors to be used corresponding to the comment information to be classified; the method and the device have the advantages that the text vector to be used is processed based on the target text classification model obtained through pre-training to obtain the target classification result, the problems of low text classification accuracy and poor effect caused by semantic recognition in the prior art are solved, the comment attribute corresponding to the comment sentence and words in the registered corpus are introduced into the feature vector of the comment information to be classified, the text classification accuracy in the set field is improved, and the technical effect of improving the text classification effect is achieved.
EXAMPLE five
Fig. 7 is a schematic structural diagram of a text classification apparatus according to a fifth embodiment of the present invention. As shown in fig. 7, the apparatus includes: a sentence to be used determination module 710, a vector to be processed determination module 720, a text vector to be used determination module 730, and a target classification result determination module 740.
The to-be-used statement determining module 710 is configured to obtain comment information to be classified, and determine a to-be-used statement corresponding to the comment information to be classified; a to-be-processed vector determining module 720, configured to determine, based on the comment attribute corresponding to the to-be-used sentence, a to-be-processed vector corresponding to each to-be-used word in the to-be-used sentence; a to-be-used text vector determining module 730, configured to process the to-be-processed vector based on a predetermined registration corpus, so as to obtain a to-be-used text vector corresponding to the to-be-classified comment information; and the target classification result determining module 740 is configured to process the to-be-used text vector based on a target text classification model obtained through pre-training to obtain a target classification result.
According to the technical scheme, comment information to be classified is obtained, and sentences to be used corresponding to the comment information to be classified are determined; determining a to-be-processed vector corresponding to each to-be-used word in the to-be-used sentence based on the comment attribute corresponding to the to-be-used sentence; processing the vectors to be processed based on a predetermined registration corpus to obtain text vectors to be used corresponding to the comment information to be classified; the method has the advantages that the text vector to be used is processed based on the target text classification model obtained through pre-training to obtain the target classification result, the problems of low text classification accuracy and poor effect caused by semantic recognition in the prior art are solved, the comment attribute corresponding to the comment sentence and words in the registered corpus are introduced into the feature vector of the comment information to be classified, the text classification accuracy in the set field is improved, and the technical effect of improving the text classification effect is achieved.
On the basis of the foregoing apparatus, optionally, the to-be-used statement determining module 710 includes a to-be-corrected statement determining unit, a to-be-filtered statement determining unit, and a to-be-used statement determining unit.
A to-be-corrected sentence determining unit, configured to perform sentence splitting processing on the to-be-classified comment information to obtain a to-be-corrected sentence;
the sentence to be filtered determining unit is used for correcting the sentence to be corrected to obtain the sentence to be filtered;
and the sentence to be used determining unit is used for performing word filtering processing on the sentence to be filtered to obtain the sentence to be used.
On the basis of the above device, optionally, the comment attribute includes an emotion attribute, and the to-be-processed vector determination module 720 includes an emotion attribute determination unit, a reverse file attribute value determination unit, and a to-be-processed vector determination unit.
The emotion attribute determining unit is used for carrying out emotion analysis on the statement to be used and determining the emotion attribute corresponding to the statement to be used;
the reverse file attribute value determining unit is used for determining a reverse file attribute value corresponding to each term to be used in the sentence to be used;
and the to-be-processed vector determining unit is used for determining to-be-processed vectors corresponding to the current to-be-used words according to the reverse file attribute values and the emotion attributes of the current to-be-used words.
On the basis of the foregoing apparatus, optionally, the to-be-used text vector determining module 730 includes a similarity determining unit, a domain-related vocabulary vector determining unit, and a to-be-used text vector determining unit.
The similarity determining unit is used for determining the to-be-registered vectors corresponding to the to-be-registered words in the registered corpus and respectively determining the similarity between the to-be-registered vectors and the to-be-processed vectors;
the domain-related vocabulary vector determining unit is used for marking the corresponding vector to be processed as a domain-related vocabulary vector if the similarity is greater than a preset threshold;
and the to-be-used text vector determining unit is used for generating the to-be-used text vector based on the domain-related vocabulary vector and the to-be-processed vector.
On the basis of the foregoing apparatus, optionally, the target classification result determining module 740 includes a target word vector determining unit, a target text vector determining unit, and a target classification result determining unit.
The target word vector determining unit is used for obtaining a target word vector corresponding to each to-be-applied vector based on each to-be-applied vector in the to-be-used text vector and the corresponding weight value;
the target text vector determining unit is used for splicing the target word vectors to obtain a target text vector corresponding to the text vector to be used;
and the target classification result determining unit is used for inputting the target text vector into the target text classification model to obtain the target classification result.
On the basis of the above device, optionally, the device further includes a registration corpus determining module, where the registration corpus determining module includes a to-be-used comment sentence determining unit, a target comment word determining unit, a similar word determining unit, and a registration corpus determining unit.
The comment to be used sentence determining unit is used for acquiring comment information to be used of at least one comment dimension and determining comment sentences to be used corresponding to the comment information to be used;
the target comment word determining unit is used for screening each comment sentence to be used based on the product attribute corresponding to the comment information to be used to obtain a target comment word;
the similar vocabulary determining unit is used for determining similar vocabularies corresponding to the target comment words based on each vocabulary to be used in a predetermined corpus to be used;
a registered corpus determining unit, configured to determine the registered corpus based on the target comment word and the similar vocabulary.
On the basis of the above apparatus, optionally, the apparatus further includes a training sample determination module, where the training sample determination module includes a history comment sentence determination unit, a history comment vector determination unit, a history to-be-used vector determination unit, a target related word vector determination unit, and a training sample determination unit.
The history comment sentence determining unit is used for acquiring history comment information and determining a history comment sentence corresponding to the history comment information;
a history comment vector determination unit, configured to determine, based on comment attributes corresponding to the history comment sentences, history comment vectors corresponding to history comment words in the history comment sentences;
a history to-be-used vector determining unit, configured to determine a history to-be-used vector corresponding to the history comment information based on a similarity between each to-be-registered vector in the registered corpus and the history comment vector;
a target related word vector determining unit, configured to determine, based on the historical to-be-used vector and a to-be-registered vector corresponding to each to-be-registered word in the registration corpus, a field related word vector corresponding to the historical to-be-used vector, and perform weighting processing on the field related word vector to obtain a target related word vector;
and the training sample determining unit is used for determining a training sample of the target text classification model obtained by training based on the historical to-be-used vector, the target related word vector and the corresponding theoretical comment label.
On the basis of the above device, optionally, the device further includes a model training module, where the model training module includes an actual comment category determination unit, a model training unit, and a model determination unit.
The actual comment category determining unit is used for inputting the historical to-be-used vector of the current training sample into the to-be-trained text classification model for each training sample to obtain an actual comment category;
the model training unit is used for training the text classification model to be trained based on the actual comment category and a theoretical comment label corresponding to a historical vector to be used in the current training sample;
and the model determining unit is used for converging the loss function in the text classification model to be trained to serve as a training target to obtain the target text classification model.
The text classification device provided by the embodiment of the invention can execute the text classification method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.
EXAMPLE six
Fig. 8 is a schematic structural diagram of an electronic device implementing the text classification method according to the embodiment of the present invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.
As shown in fig. 8, the electronic device 10 includes at least one processor 11, and a memory communicatively connected to the at least one processor 11, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, and the like, wherein the memory stores a computer program executable by the at least one processor, and the processor 11 can perform various suitable actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from a storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data necessary for the operation of the electronic apparatus 10 can also be stored. The processor 11, the ROM 12, and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.
A number of components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, or the like; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, or the like. The processor 11 performs the various methods and processes described above, such as a text classification method.
In some embodiments, the text classification method may be implemented as a computer program that is tangibly embodied in a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into the RAM 13 and executed by the processor 11, one or more steps of the text classification method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the text classification method by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
A computer program for implementing the methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. A computer program can execute entirely on a machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.
The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present invention may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solution of the present invention can be achieved.
The above-described embodiments should not be construed as limiting the scope of the invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method of text classification, comprising:
obtaining comment information to be classified, and determining sentences to be used corresponding to the comment information to be classified;
determining a to-be-processed vector corresponding to each to-be-used word in the to-be-used sentence based on the comment attribute corresponding to the to-be-used sentence;
processing the vectors to be processed based on a predetermined registration corpus to obtain text vectors to be used corresponding to the comment information to be classified;
and processing the text vector to be used based on a target text classification model obtained by pre-training to obtain a target classification result.
2. The method of claim 1, wherein the determining the sentence to be used corresponding to the comment information to be classified comprises:
performing sentence division processing on the comment information to be classified to obtain a sentence to be corrected;
correcting the statement to be corrected to obtain a statement to be filtered;
and performing word filtering processing on the statement to be filtered to obtain the statement to be used.
3. The method of claim 1, wherein the comment attribute comprises an emotion attribute, and the determining a to-be-processed vector corresponding to each to-be-used word in the to-be-used sentence based on the comment attribute corresponding to the to-be-used sentence comprises:
performing sentiment analysis on the sentence to be used, and determining sentiment attributes corresponding to the sentence to be used;
determining a reverse file attribute value corresponding to each word to be used in the sentence to be used;
and for each word to be used, determining a vector to be processed corresponding to the word to be used at present based on the reverse file attribute value and the emotional attribute of the word to be used at present.
4. The method of claim 1, wherein the processing the to-be-processed vector based on a predetermined registration corpus to obtain a to-be-used text vector corresponding to the to-be-classified comment information comprises:
determining a vector to be registered corresponding to each word to be registered in the registration corpus, and respectively determining the similarity between each vector to be registered and the vector to be processed;
if the similarity is larger than a preset threshold value, marking the corresponding vector to be processed as a field-related vocabulary vector;
and generating the text vector to be used based on the domain-related vocabulary vector and the vector to be processed.
5. The method according to claim 1, wherein the processing the to-be-used text vector based on a pre-trained target text classification model to obtain a target classification result comprises:
obtaining target word vectors corresponding to the vectors to be applied based on the vectors to be applied in the text vectors to be used and the corresponding weight values;
splicing the target word vectors to obtain target text vectors corresponding to the text vectors to be used;
and inputting the target text vector into the target text classification model to obtain the target classification result.
6. The method of claim 1, further comprising:
determining a registration corpus;
the determining a registration corpus comprises:
obtaining comment information to be used of at least one comment dimension, and determining comment sentences to be used corresponding to the comment information to be used;
screening each comment sentence to be used based on the product attribute corresponding to the comment information to be used to obtain a target comment word;
determining similar words corresponding to the target comment words based on each word to be used in a predetermined corpus to be used;
determining the registered corpus based on the target comment word and the similar vocabulary.
7. The method of claim 1, further comprising:
obtaining historical comment information, and determining historical comment sentences corresponding to the historical comment information;
determining a historical comment vector corresponding to each historical comment word in the historical comment sentences based on comment attributes corresponding to the historical comment sentences;
determining historical to-be-used vectors corresponding to the historical comment information based on the similarity between each to-be-registered vector in the registered corpus and the historical comment vector;
determining a field related word vector corresponding to the historical to-be-used vector based on the historical to-be-used vector and the to-be-registered vector corresponding to each to-be-registered word in the registration corpus, and performing weighting processing on the field related word vector to obtain a target related word vector;
and determining a training sample of the target text classification model obtained by training based on the historical to-be-used vector, the target related word vector and the corresponding theoretical comment label.
8. The method of claim 7, further comprising:
for each training sample, inputting a historical to-be-used vector of a current training sample into a to-be-trained text classification model to obtain an actual comment category;
training the text classification model to be trained based on the actual comment category and a theoretical comment label corresponding to a historical vector to be used in the current training sample;
and converging a loss function in the text classification model to be trained as a training target to obtain the target text classification model.
9. A text classification apparatus, comprising:
the sentence to be used determining module is used for acquiring comment information to be classified and determining sentences to be used corresponding to the comment information to be classified;
a to-be-processed vector determining module, configured to determine, based on the comment attribute corresponding to the to-be-used sentence, a to-be-processed vector corresponding to each to-be-used word in the to-be-used sentence;
the to-be-used text vector determining module is used for processing the to-be-processed vector based on a predetermined registration corpus to obtain a to-be-used text vector corresponding to the to-be-classified comment information;
and the target classification result determining module is used for processing the text vector to be used based on a target text classification model obtained through pre-training to obtain a target classification result.
10. An electronic device, characterized in that the electronic device comprises:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the method of text classification of any of claims 1-8.
CN202211321452.0A 2022-10-26 2022-10-26 Text classification method and device, electronic equipment and storage medium Pending CN115577109A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211321452.0A CN115577109A (en) 2022-10-26 2022-10-26 Text classification method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211321452.0A CN115577109A (en) 2022-10-26 2022-10-26 Text classification method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115577109A true CN115577109A (en) 2023-01-06

Family

ID=84586381

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211321452.0A Pending CN115577109A (en) 2022-10-26 2022-10-26 Text classification method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115577109A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116822533A (en) * 2023-07-25 2023-09-29 北京卓思天成数据咨询股份有限公司 Automobile design defect monitoring and identifying method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116822533A (en) * 2023-07-25 2023-09-29 北京卓思天成数据咨询股份有限公司 Automobile design defect monitoring and identifying method
CN116822533B (en) * 2023-07-25 2024-02-02 北京卓思天成数据咨询股份有限公司 Automobile design defect monitoring and identifying method

Similar Documents

Publication Publication Date Title
US11403680B2 (en) Method, apparatus for evaluating review, device and storage medium
CN107480143B (en) Method and system for segmenting conversation topics based on context correlation
Shoukry et al. A hybrid approach for sentiment classification of Egyptian dialect tweets
JP2019504413A (en) System and method for proposing emoji
KR20160121382A (en) Text mining system and tool
CN107273348B (en) Topic and emotion combined detection method and device for text
CN110134777B (en) Question duplication eliminating method and device, electronic equipment and computer readable storage medium
CN111339260A (en) BERT and QA thought-based fine-grained emotion analysis method
CN108009297B (en) Text emotion analysis method and system based on natural language processing
CN113780007A (en) Corpus screening method, intention recognition model optimization method, equipment and storage medium
CN113282701B (en) Composition material generation method and device, electronic equipment and readable storage medium
Sintaha et al. An empirical study and analysis of the machine learning algorithms used in detecting cyberbullying in social media
Haque et al. Opinion mining from bangla and phonetic bangla reviews using vectorization methods
CN111782793A (en) Intelligent customer service processing method, system and equipment
CN115577109A (en) Text classification method and device, electronic equipment and storage medium
US20190132274A1 (en) Techniques for ranking posts in community forums
CN112926308A (en) Method, apparatus, device, storage medium and program product for matching text
CN115757775B (en) Text inclusion-based trigger word-free text event detection method and system
Prakash et al. Lexicon Based Sentiment Analysis (LBSA) to Improve the Accuracy of Acronyms, Emoticons, and Contextual Words
CN115827867A (en) Text type detection method and device
CN115906824A (en) Text fine-grained emotion analysis method, system, medium and computing equipment
CN115563242A (en) Automobile information screening method and device, electronic equipment and storage medium
Suhasini et al. A Hybrid TF-IDF and N-Grams Based Feature Extraction Approach for Accurate Detection of Fake News on Twitter Data
CN107729509A (en) The chapter similarity decision method represented based on recessive higher-dimension distributed nature
CN114676699A (en) Entity emotion analysis method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination