CN111767403B

CN111767403B - Text classification method and device

Info

Publication number: CN111767403B
Application number: CN202010644879.9A
Authority: CN
Inventors: 刘志煌
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-07-07
Filing date: 2020-07-07
Publication date: 2023-10-31
Anticipated expiration: 2040-07-07
Also published as: CN111767403A

Abstract

The embodiment of the application discloses a text classification method and a text classification device; the embodiment of the application relates to the field of big data and the field of natural language processing of artificial intelligence; the embodiment of the application acquires the text to be classified and a word stock for text classification; word segmentation is carried out on the text to be classified, so that a plurality of text words and word sequence information of the text words are obtained; determining target category keywords existing in the text to be classified according to the text words and the category keywords; determining target text words associated with the target category keywords from the text to be classified based on word order information of the text words and the target category keywords; classifying the target text words corresponding to the target category keywords on the preset categories based on the positive characteristic words and the negative characteristic words to obtain classification results of the target category keywords; integrating the classification result of each target category keyword to obtain the category of the text to be classified; the scheme can improve the accuracy of the classification result of text classification.

Description

Text classification method and device

Technical Field

The application relates to the field of data processing, in particular to a text classification method and device.

Background

Due to the development of technology, the information available on the internet is wider and wider, the data volume is correspondingly larger, massive data needs to be processed in order to obtain the actually needed target data more efficiently and rapidly, for example, text data can be classified, keyword search can be performed on the text data to realize text data classification in the prior art, and finally improper parts in the text data can be removed. In the course of research and practice of the prior art, the inventors of the present application found that the classification results obtained by the prior art were less accurate.

Disclosure of Invention

The embodiment of the application provides a text classification method and a text classification device, which can improve the accuracy of a classification result of text classification.

The embodiment of the application provides a text classification method, which comprises the following steps:

acquiring a text to be classified and a word stock for text classification, wherein the word stock comprises category keywords, positive characteristic words and negative characteristic words corresponding to preset categories;

word segmentation is carried out on the text to be classified, so that a plurality of text words and word sequence information of the text words are obtained;

determining target category keywords existing in the text to be classified according to the text words and the category keywords;

Determining target text words associated with target category keywords from the text to be classified based on word order information of the text words and the target category keywords;

classifying the target text words corresponding to the target category keywords on the preset categories based on the positive characteristic words and the negative characteristic words to obtain classification results of the target category keywords;

and integrating the classification result of each target category keyword to obtain the category of the text to be classified.

Accordingly, an embodiment of the present application provides a text classification device, including:

the text classification module is used for classifying the text to be classified according to the preset category, and acquiring a word stock for classifying the text;

the word segmentation module is used for segmenting the text to be classified to obtain a plurality of text words and word sequence information of the text words;

the first determining module is used for determining target category keywords existing in the text to be classified according to the text words and the category keywords;

the second determining module is used for determining target text words associated with the target category keywords from the text to be classified based on word order information of the text words and the target category keywords;

The classification module is used for classifying the target text words corresponding to the target category keywords on the preset categories based on the positive characteristic words and the negative characteristic words to obtain classification results of the target category keywords;

and the integration module is used for integrating the classification result of each target category keyword to obtain the category of the text to be classified.

In some embodiments of the present application, the classification module includes a classification sub-module and an integration sub-module, wherein,

the classification sub-module is used for classifying each target text word corresponding to the target category keyword on the preset category based on the positive characteristic word and the negative characteristic word to obtain a classification result of each target text word;

and the integration sub-module is used for integrating the classification result of each target text word corresponding to the target category keyword to obtain the classification result of the target category keyword.

In some embodiments of the application, the classification submodule comprises a statistical unit and a calculation unit, wherein,

the statistics unit is used for respectively counting the occurrence frequencies of the positive characteristic words and the negative characteristic words in all text words to obtain the positive word frequency and the negative word frequency of the target category keywords on the preset category;

And the computing unit is used for carrying out classification computation on each target text word corresponding to the target category keyword based on the positive word frequency and the negative word frequency to obtain a classification result of each target text word.

In some embodiments of the application, the integration submodule comprises a counting unit and a determining unit, wherein,

the counting unit is used for respectively counting target text words with classification results of positive and negative categories to obtain positive and negative numbers;

and the determining unit is used for determining the classification result of the target category keywords based on the positive quantity and the negative quantity.

In some embodiments of the application, the determining unit is specifically configured to:

when the positive quantity is larger than the negative quantity, determining that the classification result of the target category keyword is a positive category;

and when the positive quantity is smaller than the negative quantity, determining that the classification result of the target category keyword is a negative category.

In some embodiments of the present application, the text classification apparatus further includes:

the data acquisition module is used for acquiring category reference words and sample data of preset categories;

the expansion module is used for performing near-sense expansion on the category reference words to obtain category keywords of the preset category;

The processing module is used for processing the sample data based on the category keywords, determining positive characteristic words and negative characteristic words of the preset category, and obtaining a word stock, wherein the word stock comprises category keywords, positive characteristic words and negative characteristic words corresponding to the preset category.

In some embodiments of the application, the processing module is specifically configured to:

dividing the sample data based on the category keywords to obtain positive sample data and negative sample data;

and respectively carrying out feature word mining on the positive sample data and the negative sample data based on a preset threshold value and the category keywords, and determining positive feature words and negative feature words of the preset category.

In some embodiments of the present application, the word stock includes category keywords, positive feature words, negative feature words corresponding to a plurality of preset categories, and the classification module is specifically configured to:

classifying the target text words corresponding to the target category keywords on each preset category based on the positive keywords and the negative keywords of each preset category to obtain classification results of the target category keywords on each preset category;

at this time, the integration module includes an integration sub-module, wherein,

And the integration sub-module is used for integrating the classification result of each target category keyword on each preset category and determining the category of the text to be classified.

In some embodiments of the present application, each preset category includes a positive sub-category and a negative sub-category, and the integration sub-module is specifically configured to:

determining the sub-category of the text to be classified on each preset category according to the sub-category of each target category keyword on each preset category;

and integrating the sub-categories of the texts to be classified on all preset categories to obtain the categories of the texts to be classified.

Correspondingly, the embodiment of the application also provides a storage medium, and the storage medium stores a computer program which is suitable for being loaded by a processor to execute any text classification method provided by the embodiment of the application.

Correspondingly, the embodiment of the application also provides computer equipment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes any text classification method provided by the embodiment of the application when executing the computer program.

According to the scheme, the target category keywords in the text to be classified can be determined through category keywords of preset categories in the word stock, the target text words related to the target category keywords in the text to be classified are determined through word sequence information of the text words obtained after word segmentation of the text to be classified, then the target text words are classified, the classification result of the target category keywords is determined through the classification result of each target text word of the target category keywords, finally the category of the text to be classified is determined based on the classification result of each target category keyword of the text to be classified, and after the target category keywords in the text to be classified are determined, the classification result of the target category keywords is determined through the target text words corresponding to the target category keywords, so that the accuracy of the classification result of the text classification can be remarkably improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic view of a text classification device according to an embodiment of the present application;

fig. 2 is a schematic flow chart of a text classification method according to an embodiment of the present application;

fig. 3 is a schematic view of an application scenario effect before the text classification method is applied according to the embodiment of the present application;

fig. 4 is an application scenario effect schematic diagram provided by the embodiment of the application, to which a text classification method is applied;

FIG. 5 is another flow chart of a text classification method according to an embodiment of the present application;

FIG. 6 is a schematic flow chart of a method for classifying junk text according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a text classification device according to an embodiment of the present application;

fig. 8 is another schematic structural diagram of a text classification device according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described in the present application are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to fall within the scope of the application.

The text classification method provided by the embodiment of the application relates to the field of artificial intelligence, in particular to the field of machine learning in the field of artificial intelligence.

Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

Natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.

The application relates to the technology of natural language processing and the like in the field of artificial intelligence, and the processes of the near-sense expansion, the word segmentation and the like can be completed through natural language processing numbers of the artificial intelligence, and specific contents are described through the embodiment.

The embodiment of the application provides a text classification method and a text classification device. Specifically, the embodiment of the application can be integrated in a text classification device, the text classification device can be integrated in text classification computer equipment, the text classification computer equipment can be electronic equipment such as a terminal, and the terminal can be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart sound box, a smart watch and the like, but is not limited to the text classification computer equipment; the text classification computer device may also be an electronic device such as a server, where the server may be an independent physical server, may be a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides cloud computing services. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the present application is not limited herein.

As shown in fig. 1, fig. 1 is a schematic view of a text classification device according to an embodiment of the present application. The terminal can acquire a text to be classified and a word stock for text classification, wherein the word stock comprises category keywords, positive characteristic words and negative characteristic words corresponding to preset categories; word segmentation is carried out on the text to be classified, so that a plurality of text words and word sequence information of the text words are obtained; determining target category keywords existing in the text to be classified according to the text words and the category keywords; determining target text words associated with the target category keywords from the text to be classified based on word order information of the text words and the target category keywords; classifying the target text words corresponding to the target category keywords on the preset categories based on the positive characteristic words and the negative characteristic words to obtain classification results of the target category keywords; and integrating the classification result of each target category keyword to obtain the category of the text to be classified.

The embodiment of the application can also finish the text classification process through the cooperation of the terminal and the server, for example, the terminal can transmit the text to be classified input by the user to the server, the server can receive the text to be classified transmitted by the terminal and acquire a word stock for text classification, and then the server can finish the text classification process through the text classification method of the application to acquire the category of the text to be classified and trigger the further operation of the terminal and the server based on the category.

It should be noted that, the schematic view of the scenario of the text classification device shown in fig. 1 is only an example, and the text classification device and the scenario described in the embodiment of the present application are for more clearly describing the technical solution of the embodiment of the present application, and do not constitute a limitation on the technical solution provided by the embodiment of the present application, and those skilled in the art can know that, with the evolution of the text classification device and the appearance of a new service scenario, the technical solution provided by the embodiment of the present application is equally applicable to similar technical problems.

The following will describe in detail.

In this embodiment, description will be made in terms of a text classification apparatus that may be integrated in a terminal specifically, for example, a terminal provided with a storage unit, a microprocessor mounted, such as a camera, a video camera, a smart phone, a tablet computer, a notebook computer, a personal computer, and a wearable intelligent device.

Fig. 2 is a schematic flow chart of a text classification method according to an embodiment of the application. The text classification method may include:

101. and obtaining the text to be classified and a word stock for text classification, wherein the word stock comprises category keywords, positive characteristic words and negative characteristic words corresponding to preset categories.

The text to be classified can comprise text data to be classified, and the text data can be edited by a user, for example, in an editing box of an application program client, the user can manually input data information comprising the text data; the text data may also be automatically generated by a particular system based on a particular function, such as a text generation system (e.g., marketing number generator, waste word generator, etc.) that may automatically generate a preset number of words based on a particular topic, etc. The text to be classified can be sentence-forming or paragraph-forming text, and the text to be classified can comprise basic elements such as punctuation marks, operation characters, numbers, letters, words (words in different languages such as Chinese, english and the like) and the like.

The preset categories may include summary information of similar text data content, and the common preset categories may include various kinds of evaluation categories (such as film and television evaluation, food evaluation, etc.), advertisement categories, etc., for example, category keywords of advertisement text may include a grand premium, a skip price, a holiday, etc.; the category keywords of the food evaluation category text may include delicacy, fruit, braised, etc.

The category keywords may include words or combinations of words with a frequency greater than a preset threshold when a plurality of people describe the content of the preset category, the category keywords are important contents for realizing efficient and accurate text classification, and in the process of determining the category keywords, various modes may be available, for example, a plurality of text data samples of the preset category may be segmented, word frequency statistics may be performed on all words obtained by the segmentation, and then the category keywords of the preset category in which the words with a frequency greater than the preset word frequency are determined.

For example, the category keywords can be manually determined by a developer or a user, and the time for manually determining the category keywords can be before starting text classification, or in the text classification process, related personnel including the developer or the user can perform operations such as adding or deleting the category keywords according to actual requirements, and the adjustable category keywords enable the scheme to be more flexible, timely conversion under different scenes, and further high accuracy of classification results is guaranteed.

The category keywords can be used as high-frequency words describing a preset category, but cannot represent text data containing the category keywords, namely, the text data belong to the preset category, in order to ensure high accuracy of text classification, common feature words except the category keywords in the text data are required to be collected and used as feature words of the preset category, wherein the feature words can comprise positive feature words and negative feature words, the positive feature words and the negative feature words can respectively represent the common feature words belonging to the preset category and not belonging to the preset category, the positive feature words and the negative feature words are similar to the category keywords and can be words, characters (including letters, characters, numbers, symbols and the like) representing specific meanings, such as alpha-goods, poplar forests, +V and the like.

The positive characteristic words and the negative characteristic words can be determined through text data samples of preset categories, wherein various determination modes can be adopted, for example, the positive characteristic words and the negative characteristic words can be manually determined by related personnel, and the positive characteristic words and the negative characteristic words can be applicable to the situation that the number of the text data samples is small or high-quality and accurate characteristic words cannot be obtained through other modes; determining characteristic words through the trained neural network model; feature word mining may also be performed on text data samples by algorithms, and so on. The manner of determining the feature words can be flexibly selected according to the characteristics and actual requirements of the text data samples, and will not be described here.

For example, the preset category may be advertisement category, the text to be identified may be "one-to-one repeated engraving high imitation brand multiple satellite knot same number 78666", the category keywords comprising advertisement category in the word stock may be "high imitation, repeated engraving, checking goods, factory, tail note, weChat, satellite, V letter", the forward feature words of advertisement category may be "factory, real shooting, consultation, detail, field, literacy, tiger puff", and the negative feature words of advertisement category may be "resisting, trading, illegal, rampant, business, infringement".

In some embodiments of the present application, the text recognition method may further include the steps of:

Acquiring category reference words and sample data of a preset category; performing near-sense expansion on the category reference words to obtain category keywords of a preset category; based on category keywords, sample data are processed, positive characteristic words and negative characteristic words of preset categories are determined, and a word stock is obtained, wherein the word stock comprises category keywords, positive characteristic words and negative characteristic words corresponding to the preset categories.

The category reference words may be basic words with a frequency greater than a preset threshold value when describing the content of the preset category, for example, "WeChat" is a category reference word, then, the category reference words may be subjected to near-sense expansion, for example, "WeChat" may be expanded into "satellite, V-letter, contact way, +V", etc., the near-sense expansion may be performed in various manners, and the near-sense expansion may be performed by a dictionary (for example, a semantic word forest), a neural network model (for example, a word vector model), and may also be performed by related personnel (for example, developers, users, etc.), etc. After the near-sense expansion is completed, the category reference words and words obtained by performing the near-sense expansion on the category reference words can be used as category keywords of preset categories, for example, the category keywords of advertisement categories can comprise WeChat, satellite, V letter, contact way and +V.

The method comprises the steps of obtaining a text classification method, wherein sample data, namely a text data sample, can be related to a set category, processing the sample data based on category keywords to obtain positive characteristic words and negative characteristic words which are key steps for realizing the text classification, wherein the positive characteristic words and the negative characteristic words are key reference information required by the text classification method, and the processing process can be flexibly selected according to actual conditions and is not repeated herein.

For example, the category reference words "high imitation" of the advertisement class are obtained, and the advertisement class is subjected to near-sense expansion through the synonym forest, so that the "fine imitation, the" a goods, the factory goods and the original factory "can be obtained, and then the" high imitation, the fine imitation, the "a goods, the factory goods and the original factory" can be used as category keywords of the advertisement class, then the obtained sample data can be processed through the category keywords, and the sample data can be "false-making and false-selling and rampant, the work and commerce department makes a punch, the so-called original factory shoes" are strongly hit, and the "wet shoes are fully hit with the satellite numbers 1234" when the large brands are purchased, so that the positive characteristic words and the negative characteristic words of the advertisement class can be "rampant and hit", and the obtained negative characteristic words can be "folding purchase and brands".

In some embodiments of the present application, the step of "processing the sample data based on the category keyword to determine the positive feature word and the negative feature word of the preset category" may include:

dividing sample data based on category keywords to obtain positive sample data and negative sample data; and respectively carrying out feature word mining on the positive sample data and the negative sample data based on a preset threshold value and category keywords, and determining positive feature words and negative feature words of a preset category.

Specifically, the sample data related to the preset category may be divided, if the sample data belongs to the preset category, the sample data is positive sample data, and if the sample data does not belong to the preset category, the sample data is negative sample data. The division can be performed manually (manual labeling), or automatically by an algorithm or a trained neural network model.

After the sample data is divided, positive feature words of a preset category may be determined based on the positive sample data, negative feature words of the preset category may be determined based on the negative sample data, and specifically, feature word mining may be performed on the positive/negative sample data, for example, positive feature words and negative feature words may be mined by a Prefix projection pattern mining (Prefix-Projected Pattern Growth) algorithm.

For example, first, preprocessing may be performed on positive/negative sample data, preprocessing may filter out irrelevant information such as punctuation marks, numbers, etc. in the positive/negative sample data (such as regular filtering), then, filtering out category keywords existing in the positive/negative sample data to obtain filtered positive/negative sample data, performing word segmentation on the filtered positive/negative sample data to obtain a plurality of sample words, where the word segmentation process may be performed through a word segmentation tool (such as barker segmentation).

Feature word mining may then be performed on the sample data, and the mining process may include:

1. finding out a word sequence prefix with unit length of 1 and a corresponding projection data set;

2. counting the occurrence frequency of word sequence prefixes and adding word sequence prefixes with the occurrence frequency higher than a minimum support threshold value to a data set to obtain i=1 frequent word sequences;

3. recursively mining all word sequence prefixes with the length of i and meeting the minimum support requirement:

1) Digging a projection data set of the word sequence prefix, and returning to recursion if the projection data is an empty set;

2) Counting the minimum support degree of each item in the corresponding projection data set, combining each item meeting the minimum support degree with the current prefix to obtain a new prefix, and recursively returning if the minimum support degree requirement is not met; 3) Let i=i+1, the prefix of the word sequence be each new prefix after merging, respectively recursively execute the step 3;

4. And returning all frequent word sequences in the word sequence data set.

The frequent word sequence is the characteristic word, and the positive characteristic word and the negative characteristic word can be obtained by respectively carrying out characteristic word mining on the positive sample data and the negative sample data in the mode.

The minimum support degree can be determined based on the minimum support rate, the minimum support rate can be flexibly adjusted according to factors such as preset types and quantity of sample data in the practical process, and the calculation method of the minimum support degree can be as follows:

min_sup＝a×n

wherein min_sup is the minimum support, a is the minimum support rate, and n is the number of sample data.

102. And segmenting the text to be classified to obtain a plurality of text words and word sequence information of the text words.

The text word may include a number of words constituting the text to be classified, the word order information of the text word may be order information of the text word in the text to be classified,

the sentence is composed of a plurality of words, different sentences can be formed by different sequences of the words in the sentence, different meanings are conveyed, the computer equipment divides the sentence into correct words, namely word segmentation, and word sequence information of text words is obtained by word segmentation based on the important function of the sequence of the words in the sentence.

In the actual operation process, the word segmentation can be performed through a word segmentation tool, and common principles of the word segmentation tool can include dictionary-based, machine learning-based and the like, and common word segmentation tools can include a Pad Ding Jieniu, a Abstract word segmentation device, a Stanford word segmentation device and the like.

In order to obtain a word segmentation result with higher accuracy, the text to be classified may be preprocessed, such as screening out useless characters, before word segmentation.

The method and the device can carry out classification detection on partial words in the text to be classified, further determine the category of the text to be classified, and segment the text to be classified to obtain a plurality of text words, so that the method and the device are the basis for realizing classification detection on the partial words.

For example, the word segmentation is performed on the "one-to-one repeated high imitation brand multiple satellite buckle same number 78666", the pretreatment can be performed first, the numbers in the word can be screened out, the "one-to-one repeated high imitation brand multiple satellite buckle same number" is obtained, and then the word segmentation can be performed to obtain text words and word sequence information of the text words, which comprises the following steps: one-to-one (1), multi-engraving (2), high imitation (3), brand (4), complete (5), satellite (6), button (7) and same number (8).

103. And determining target category keywords existing in the text to be classified according to the text words and the category keywords.

The target analog keywords may be category keywords existing in the text to be classified, specifically, may search for text words in the text to be classified, whether text words identical to the category keywords exist, and if so, the target category keywords.

For example, the category keywords may include "high imitation, duplication, inspection, original factory, tail bill, weChat, satellite, and V letter", and in the text words of the text to be classified, searching for each category keyword may finally determine that the target category keywords of the text to be classified have "duplication", "high imitation" and "satellite".

104. Based on word order information of the text words and the target category keywords, determining target text words associated with the target category keywords from the text to be classified.

The target text word may be included in the text to be classified, the text word associated with the target category keyword may specifically be a text word adjacent to the context of the target category keyword, the number of adjacent text words may be flexibly determined according to actual requirements, for example, may be set to 2 or 3, etc., the computer device may determine the position of the target category keyword in the text to be classified according to word order information of the text word, determine word order information of a text word with a set number of contexts, and obtain the target text word corresponding to the target category keyword according to the word order information.

For example, for the target category keyword "satellite" in the "one-to-one repeated high imitation brand complete satellite buckle same number 78666", the corresponding target text word may be "brand", "complete", "buckle", "same number".

105. Based on the positive characteristic words and the negative characteristic words, classifying the target text words corresponding to the target category keywords on the preset categories to obtain classification results of the target category keywords.

The target text word corresponding to the target category keyword is classified, so as to determine a classification result of the target keyword, where the classification result of the target category keyword and the classification result of the target text word may be two types, that is, belong to the preset category (positive category) and not belong to the preset category (negative category).

For example, the positive characteristic words of the advertisement class can include "genuine, real shooting, consultation, details, field, recognition and tiger puff", the negative characteristic words of the advertisement class can include "resisting, trading, illegal, rampant, business and infringement", and the text to be classified is classified by the target text word "brand", "complete", "buckling", "same number" corresponding to the target text word "satellite" of the same number 78666 "of the one-to-one repeated-carved high-imitation brand-number satellite buckling, so that the classification result of the target text to be classified can be determined as the positive category.

In some embodiments of the present application, the step of classifying, based on the positive feature word and the negative feature word, the target text word corresponding to the target category keyword on the preset category to obtain the classification result of the target category keyword may include:

classifying each target text word corresponding to the target category keyword on the basis of the positive characteristic word and the negative characteristic word to obtain a classification result of each target text word; and integrating the classification result of each target text word corresponding to the target category keyword to obtain the classification result of the target category keyword.

The target text words of the target category keywords can be multiple, so that multiple classification results can be obtained after the target text words corresponding to the target category keywords are classified, and the classification results of the target category keywords can be obtained after the multiple classification results are integrated. The integration method may be to perform weighted calculation on the classification results, and the weight of each classification result may be flexibly set, for example, may be based on word order information of the target text word relative to the target category keyword, etc.

For example, the target text words "brand", "complete", "button", "same number" are classified to obtain classification results of "positive category", "negative category", and then these classification results can be integrated to obtain the classification results "positive category" of the target category keyword "satellite" corresponding to these target text words.

respectively counting the occurrence frequency of the positive characteristic words and the negative characteristic words in all text words to obtain the positive word frequency and the negative word frequency of the target category keywords in the preset category; and based on the positive word frequency and the negative word frequency, carrying out classification calculation on each target text word corresponding to the target category keyword to obtain a classification result of each target text word.

When the texts to be classified are classified, the number of a batch of texts to be classified can be at least one, before each target text word is classified, the occurrence probability of all positive characteristic words and all negative characteristic words in all the texts to be classified, the occurrence frequency of the target text words and the probability of each target text word and each positive/negative characteristic word occurring simultaneously can be counted, then calculation can be carried out through an algorithm, the calculation result of the target text words can be obtained, and the classification result of the target text words can be further determined.

For example, the classification calculation of the target text word corresponding to a target category keyword "satellite" in the text to be classified can be performed by using an emotion tendency point mutual information algorithm (PMI, pointwise Mutual Information),

after the calculation of mutual information of the target text word and all the characteristic words is completed, the mutual emotion tendency information SO_PMI of the words of the target text word can be obtained:

wherein P is _set To include a predetermined categoryThe collection of forward feature words, pw is the forward feature word, N _set For a set containing negative characteristic words of a preset category, nw is the negative characteristic word.

The calculation formula of the mutual information PMI can be:

wherein w is ₁ Is a target text word, w ₂ Is a feature word.

Specifically, the target text word of "satellite" may include "brand", "complete", "button", "same number", the positive feature word may be "factory, consultation", and the negative feature word may be "reject, hit". For example, before determining the classification result of the target text word "brand", P (original factory), P (consultation), P (rejection), P (hit), P (brand) and P (brand, original factory) (i.e. probability of co-occurrence of brand and original factory), P (brand, consultation), P (brand, rejection), P (brand, hit) may be determined in all the texts to be classified, and based on these probability values, the relevant calculation result of the target text word "brand" may be obtained, and thus the classification result of the target text word may be determined, and the calculation formula may be as follows:

So_pmi (brand) =pmi (brand, original plant) +pmi (brand, consultation)

PMI (brand, counteraction) -PMI (brand, hit)

The calculation formula of the PMI (brand, original factory) can be as follows:

when the result of the so_pmi (brand) is greater than 0, the classification result of the target text word "brand" may be determined as a positive category, and when the result of the so_pmi (brand) is less than 0, the classification result of the target text word "brand" may be determined as a negative category.

After classification calculation is sequentially completed on the target text words of 'brand', 'complete', 'buckling', 'same number', classification results of 'positive category', 'negative category' can be obtained.

In some embodiments of the present application, the step of integrating the classification result of each target text word corresponding to the target category keyword to obtain the classification result of the target category keyword may include:

respectively counting target text words with positive and negative classification results to obtain positive and negative numbers; and determining the classification result of the target category keywords based on the positive quantity and the negative quantity.

For example, the count result may be that the positive number is 3, and the negative number is 1, that is, the classification result of the target category keyword "satellite" may be determined to be the positive category based on this.

In some embodiments of the present application, the step of determining the classification result of the target category keyword based on the positive number and the negative number may include:

when the positive quantity is larger than the negative quantity, determining that the classification result of the target category keywords is positive category; and when the positive quantity is smaller than the negative quantity, determining that the classification result of the target category keyword is a negative category.

In addition, when the positive number is equal to the negative number, the classification result of the target category keyword can be a positive category or a negative category, and the target category keyword can be flexibly set according to actual requirements during operation.

For example, if the positive number 3 is greater than the negative number 1, it may be determined that the classification result of the target category keyword is a positive category.

106. And integrating the classification result of each target category keyword to obtain the category of the text to be classified.

The categories of the texts to be classified can comprise positive categories and negative categories, the target category keywords of the texts to be classified can comprise a plurality of target category keywords, and the classification results of each target category keyword are integrated, so that the categories of the texts to be classified can be determined.

For example, the classification result of each target category keyword can be counted, and the most number is the category of the text to be classified.

In some embodiments of the present application, the word stock includes a plurality of category keywords corresponding to a preset category, and a positive feature word and a negative feature word step of classifying, based on the positive feature word and the negative feature word, the target text word on the preset category to obtain a classification result of the target category keywords may include:

based on the positive keywords and the negative keywords of each preset category, classifying the target text words corresponding to the target category keywords on each preset category to obtain the classification result of the target category keywords on each preset category.

At this time, the step of integrating the classification result of each target category keyword to obtain the category of the text to be classified may include:

and integrating classification results of the target category keywords on each preset category to determine the category of the text to be classified.

In an actual application scenario, text classification can be aimed at multiple categories, for example, text categories such as advertisements, illegal, popular terms and the like can be collectively called garbage categories, so that when text classification is performed on a text to be classified, whether the text to be classified belongs to multiple categories needs to be judged, and then the category (garbage category or non-garbage category) of the text to be classified is determined.

Therefore, after determining the target category keywords in the text to be classified, the target text words corresponding to the target category keywords need to be classified on each preset category based on the positive characteristic words and the negative characteristic words of each preset category, so as to obtain the classification result of the target category keywords on each preset category. And integrating classification results of the target keywords of the text to be classified on each preset category to determine the category of the text to be classified.

In some embodiments of the present application, each preset category includes a positive sub-category and a negative sub-category, and the step of integrating the classification result of each target category keyword on each preset category to determine the category of the text to be classified may include:

determining the sub-category of the text to be classified on each preset category according to the sub-category of each target category keyword on each preset category; and integrating the sub-categories of the texts to be classified on all preset categories to obtain the categories of the texts to be classified.

On each preset category, the classification recognition mode of the application finally determines whether the text to be classified belongs to each preset category, namely each preset category comprises a positive sub-category (belonging to the preset category) and a negative sub-category (not belonging to the preset category), after determining the sub-category of the text to be classified on each preset category, the category of the text to be classified can be determined according to the sub-categories on all preset categories, for example, the recognition result of the text to be classified on advertisement categories and hyponym categories is the negative sub-category, and the recognition result of the text to be classified on illegal categories is the positive sub-category, the text to be recognized can be determined as garbage categories, in particular as illegal categories in garbage categories; for another example, if the recognition result of the text to be classified in the advertisement class, the popular term class and the illegal class is a negative sub-class, all sub-classes can be integrated, and the text to be classified is determined to be a non-junk class.

Big data (Big data) refers to a data set which cannot be captured, managed and processed by a conventional software tool within a certain time range, and is a massive, high-growth-rate and diversified information asset which needs a new processing mode to have stronger decision-making ability, insight discovery ability and flow optimization ability. With the advent of the cloud age, big data has attracted more and more attention, and special techniques are required for big data to effectively process a large amount of data within a tolerant elapsed time. Technologies applicable to big data include massively parallel processing databases, data mining, distributed file systems, distributed databases, cloud computing platforms, the internet, and scalable storage systems.

In an actual application scene, a large amount of data, such as a large amount of texts to be classified, a large amount of feature words, category keywords and the like, can be available, and related steps in a text classification method can be completed based on a large data related technology, so that a classification result with higher accuracy can be obtained.

The method and the device can firstly acquire a text to be classified and a word stock for text classification, wherein the word stock comprises category keywords, positive characteristic words and negative characteristic words corresponding to preset categories, then performs word segmentation on the text to be classified to obtain a plurality of text words and word sequence information of the text words, then determines target category keywords existing in the text to be classified according to the text words and the category keywords, then determines target text words associated with the positions of the target category keywords from the text to be classified based on the word sequence information of the text words and the target category keywords, classifies the target text words corresponding to the target category keywords on the preset categories based on the positive characteristic words and the negative characteristic words, obtains classification results of the target category keywords, and finally integrates the classification results of each target category keyword to obtain the category of the text to be classified.

According to the scheme, the target category keywords in the text to be classified can be determined through category keywords of preset categories in the word stock, the target text words related to the positions of the target category keywords in the text to be classified are determined through word sequence information of the text words obtained after the text to be classified is segmented, then the target text words are classified, the classification result of each target text word of the target category keywords is determined through the classification result of each target text word of the target category keywords, finally the category of the text to be classified is determined based on the classification result of each target category keyword of the text to be classified, and after the target category keywords in the text to be classified are determined, the classification result of the target category keywords is determined through the target text words corresponding to the target category keywords, so that the accuracy of the classification result of the text can be remarkably improved.

The method described in the above embodiments is described in further detail below by way of example.

In this embodiment, the junk text is classified as an example, the junk text may be a type of text, the junk text may include a plurality of preset categories, such as advertisement categories, illegal information categories, popular words, and the like, the text classification method of this embodiment may be widely applied to scenes where junk text identification is required, such as junk mails, junk barrages, junk short messages, and the like, and may refer to fig. 3, in the video playing process, the barrages may include junk barrages such as "highly imitation ball shoes brand alignment +v78666", "poplar woodcarving brand warmth 40 West letter 1111", and after the method is used, the barrage content may be classified, and the junk barrages may be screened before the barrage display, so that the viewing experience of users is greatly improved, and referring to fig. 4, the barrage content is normal comment information related to the played content.

In this embodiment, taking the junk text classification performed by the server as an example, a text classification method is described, and a flowchart of this embodiment may be referred to fig. 5, where:

201. the method comprises the steps that a server receives texts to be classified, and a word stock for text classification is loaded, wherein the word stock comprises a plurality of category keywords of preset categories, positive characteristic words and negative characteristic words.

For example, the preset categories include a restriction category, an advertisement promotion category, and a low-custom term category.

202. The server divides words of the text to be classified to obtain text words corresponding to the text to be classified and word sequence information of the text words.

203. The server determines target category keywords in the text to be classified, wherein the target category keywords are text words which are the same as the category keywords in the text to be classified.

For example, it may be determined that the target category keywords of the restriction category in the text to be classified are: word 1, word 2; the target category keywords of the advertisement promotion class are words 3 and 4; the target category keywords of the hyponym term class are words 5 and 6.

204. And the server determines target text words associated with the target category keywords in the text to be classified according to the word order information of the text words.

For example, the target text word corresponding to word 1 (target category keyword) is: word 7, word 8, word 9, word 10.

205. The server classifies each target text word of the target category keywords according to the positive characteristic word and the negative characteristic word of each preset category, and a classification result of each target text word on each preset category is obtained.

The classification results may include a positive category and a negative category, for example, the classification results of an advertisement promotion category may include a positive category (representing the word or text as an advertisement promotion category) and a negative category (representing the word or text as a non-advertisement promotion category).

For example, the classification results of the word 7 (target text word) corresponding to the word 1 (target category keyword) in the restriction category, the advertisement promotion category and the low-custom term category are respectively a positive category, a negative category and a negative category.

206. The server integrates the classification results of all the target text words of the target category keywords on each preset category to obtain the classification results of the target category keywords on each preset category.

For example, all target text words of the word 1 (target category keywords) are integrated, and the classification results of the word 1 in the restriction category, the advertisement promotion category and the colloquial term category are respectively positive category, positive category and negative category.

207. The server integrates the classification results of all target category keywords of the text to be classified on each preset category to obtain the classification result of the text to be classified on each preset category.

For example, the classification results of all target category keywords (word 1, word 2, word 3, word 4, word 5 and word 6) of the text to be classified in the restriction category, the advertisement promotion category and the hyponym category are integrated, and the classification results of the text to be classified in the restriction category, the advertisement promotion category and the hyponym category are determined to be positive category, negative category and negative category respectively.

208. The server determines the category of the text to be classified based on the classification result of the text to be classified on each preset category.

For example, the classification result of the text to be classified on the restriction class, the advertisement promotion class and the low-custom term class can be determined, and the text to be classified is the restriction class junk text.

Referring to fig. 6, the application can firstly construct a garbage text reference word (i.e. a category keyword), acquire positive and negative samples (i.e. positive sample data and negative sample data of a preset category) of a classification training set, then mine positive and negative context feature words (i.e. positive feature words and negative feature words) of the garbage word through a frequent sequence mode, then carry out text classification on the text to be classified, specifically, match the garbage word (a target category keyword) with an N-gram window as the context feature word (a target text word), calculate garbage classification polarity (i.e. classification result) of the window word (i.e. the target text word) by using an SO-PMI, and finally acquire a text classification category (i.e. the category of the text to be recognized) by integrating the feature word classification polarity (i.e. classification result of the target category keyword).

According to the method and the device, after the target category keywords in the text to be classified are determined, the target text words related to the positions of the target text words are determined, and based on the classification results of the target text words, the classification results of the target category keywords are determined, instead of simply determining the classification results through the target category keywords, and further determining the category of the text to be classified, the accuracy of text classification can be effectively improved.

In order to facilitate better implementation of the text classification method provided by the embodiment of the application, the embodiment of the application also provides a device based on the text classification method. Where the meaning of nouns is the same as in the text classification method described above, specific implementation details may be referred to in the description of the method embodiments.

As shown in fig. 7, fig. 7 is a schematic structural diagram of a text classification device according to an embodiment of the present application, where the text classification device may include an obtaining module 301, a word segmentation module 302, a first determining module 303, a second determining module 304, a classification module 305, and an integrating module 306, where,

The obtaining module 301 is configured to obtain a text to be classified and a word stock for classifying the text, where the word stock includes a category keyword, a positive feature word, and a negative feature word corresponding to a preset category;

the word segmentation module 302 is configured to segment a text to be classified to obtain a plurality of text words and word order information of the text words;

a first determining module 303, configured to determine, according to the text word and the category keyword, a target category keyword that exists in the text to be classified;

a second determining module 304, configured to determine, from the text to be classified, a target text word associated with the target category keyword based on word order information of the text word and the target category keyword;

the classification module 305 is configured to classify, on the basis of the positive feature word and the negative feature word, the target text word corresponding to the target category keyword on a preset category, so as to obtain a classification result of the target category keyword;

and the integration module 306 is used for integrating the classification result of each target category keyword to obtain the category of the text to be classified.

In some embodiments of the present application, referring to fig. 8, classification module 305 includes a classification sub-module 3051 and an integration sub-module 3052, wherein,

the classification sub-module 3051 is configured to classify each target text word corresponding to the target category keyword on the preset category based on the positive feature word and the negative feature word, so as to obtain a classification result of each target text word;

And the integration submodule 3052 is used for integrating the classification result of each target text word corresponding to the target category keyword to obtain the classification result of the target category keyword.

the statistics unit is used for respectively counting the occurrence frequency of the positive characteristic words and the negative characteristic words in all text words to obtain the positive word frequency and the negative word frequency of the target category keywords in the preset category;

the expansion module is used for performing near-sense expansion on the category reference words to obtain category keywords of a preset category;

the processing module is used for processing the sample data based on the category keywords, determining positive characteristic words and negative characteristic words of preset categories, and obtaining a word stock, wherein the word stock comprises the category keywords, the positive characteristic words and the negative characteristic words corresponding to the preset categories.

dividing sample data based on category keywords to obtain positive sample data and negative sample data;

and respectively carrying out feature word mining on the positive sample data and the negative sample data based on a preset threshold value and category keywords, and determining positive feature words and negative feature words of a preset category.

In this embodiment, the obtaining module 301 may first obtain a text to be classified and a word stock for text classification, where the word stock includes a category keyword, a positive feature word and a negative feature word corresponding to a preset category, then the word segmentation module 302 may segment the text to be classified to obtain a plurality of text words and word order information of the text words, the first determining module 303 may determine, according to the text words and the category keyword, a target category keyword existing in the text to be classified, then the second determining module 304 determines, based on the word order information of the text words and the target category keyword, a target text word associated with a position of the target category keyword from the text to be classified, and the classification module 305 may classify, based on the positive feature word and the negative feature word, the target text word corresponding to the target category keyword on the preset category to obtain a classification result of the target category keyword, and finally the integrating module 306 may integrate the classification result of each target category keyword to obtain a category of the text to be classified.

In addition, the embodiment of the present application further provides a computer device, which may be a terminal or a server, as shown in fig. 9, which shows a schematic structural diagram of the computer device according to the embodiment of the present application, specifically:

the computer device may include one or more processors 401 of a processing core, memory 402 of one or more computer readable storage media, a power supply 403, and an input unit 404, among other components. Those skilled in the art will appreciate that the computer device structure shown in FIG. 9 is not limiting of the computer device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components. Wherein:

The processor 401 is a control center of the computer device, connects various parts of the entire computer device using various interfaces and lines, and performs various functions of the computer device and processes data by running or executing software programs and/or modules stored in the memory 402, and calling data stored in the memory 402, thereby performing overall monitoring of the computer device. Optionally, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor and a modem processor, wherein the application processor mainly processes an operating system, a user page, an application program, etc., and the modem processor mainly processes wireless communication. It will be appreciated that the modem processor described above may not be integrated into the processor 401.

The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by executing the software programs and modules stored in the memory 402. The memory 402 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data created according to the use of the computer device, etc. In addition, memory 402 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 with access to the memory 402.

The computer device further comprises a power supply 403 for supplying power to the various components, preferably the power supply 403 may be logically connected to the processor 401 by a power management system, so that functions of charge, discharge, and power consumption management may be performed by the power management system. The power supply 403 may also include one or more of any of a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.

The computer device may also include an input unit 404, which input unit 404 may be used to receive input numeric or character information and to generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

Although not shown, the computer device may further include a display unit or the like, which is not described herein. In particular, in this embodiment, the processor 401 in the computer device loads executable files corresponding to the processes of one or more application programs into the memory 402 according to the following instructions, and the processor 401 executes the application programs stored in the memory 402, so as to implement various functions as follows:

Acquiring a text to be classified and a word stock for text classification, wherein the word stock comprises category keywords, positive characteristic words and negative characteristic words corresponding to preset categories; word segmentation is carried out on the text to be classified, so that a plurality of text words and word sequence information of the text words are obtained; determining target category keywords existing in the text to be classified according to the text words and the category keywords; determining target text words associated with the target category keywords from the text to be classified based on word order information of the text words and the target category keywords; classifying the target text words corresponding to the target category keywords on the preset categories based on the positive characteristic words and the negative characteristic words to obtain classification results of the target category keywords; and integrating the classification result of each target category keyword to obtain the category of the text to be classified.

The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.

It will be appreciated by those of ordinary skill in the art that all or part of the steps of the various methods of the above embodiments may be performed by a computer program, or by computer program control related hardware, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, an embodiment of the present application further provides a storage medium in which a computer program is stored, the computer program being capable of being loaded by a processor to perform the steps of any of the text classification methods provided by the embodiments of the present application. For example, the computer program may perform the steps of:

acquiring a text to be classified and a word stock for text classification, wherein the word stock comprises category keywords, positive characteristic words and negative characteristic words corresponding to preset categories; word segmentation is carried out on the text to be classified, so that a plurality of text words and word sequence information of the text words are obtained; determining target category keywords existing in the text to be classified according to the text words and the category keywords; determining a target text word associated with the position of the target category keyword from the text to be classified based on word order information of the text word and the target category keyword; classifying the target text words corresponding to the target category keywords on the preset categories based on the positive characteristic words and the negative characteristic words to obtain classification results of the target category keywords; and integrating the classification result of each target category keyword to obtain the category of the text to be classified.

Wherein the storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.

The steps in any text classification method provided by the embodiment of the present application can be executed by the computer program stored in the storage medium, so that the beneficial effects that any text classification method provided by the embodiment of the present application can be achieved, and detailed descriptions of the previous embodiments are omitted herein.

The foregoing has described in detail a text classification method and apparatus provided by embodiments of the present application, and specific examples have been applied herein to illustrate the principles and embodiments of the present application, the above description of the embodiments being only for aiding in the understanding of the method and core ideas of the present application; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in light of the ideas of the present application, the present description should not be construed as limiting the present application.

Claims

1. A method of text classification, comprising:

acquiring a text to be classified and a word stock for text classification, wherein the word stock comprises category keywords, positive characteristic words and negative characteristic words corresponding to preset categories; the positive characteristic word represents the characteristic word belonging to the preset category, and the negative characteristic word represents the characteristic word not belonging to the preset category;

2. The method of claim 1, wherein the classifying, based on the positive feature word and the negative feature word, the target text word corresponding to the target category keyword on the preset category to obtain the classification result of the target category keyword includes:

classifying each target text word corresponding to the target category keyword on the preset category based on the positive characteristic word and the negative characteristic word to obtain a classification result of each target text word;

And integrating the classification result of each target text word corresponding to the target category keyword to obtain the classification result of the target category keyword.

3. The method according to claim 2, wherein the classifying each target text word corresponding to the target category keyword on the preset category based on the positive feature word and the negative feature word to obtain a classification result of each target text word includes:

respectively counting the occurrence frequencies of the positive characteristic words and the negative characteristic words in all text words to obtain positive word frequency and negative word frequency of the target category keywords on the preset category;

and based on the positive word frequency and the negative word frequency, carrying out classification calculation on each target text word corresponding to the target category keyword to obtain a classification result of each target text word.

4. The method according to claim 2, wherein the classification result includes a positive category and a negative category, and the integrating the classification result of each target text word corresponding to the target category keyword to obtain the classification result of the target category keyword includes:

respectively counting target text words with positive and negative classification results to obtain positive and negative numbers;

And determining the classification result of the target category keywords based on the positive quantity and the negative quantity.

5. The method of claim 4, wherein the determining the classification result of the target category keyword based on the positive number and the negative number comprises:

6. The method according to claim 1, wherein the method further comprises:

acquiring category reference words and sample data of a preset category;

performing near-sense expansion on the category reference words to obtain category keywords of the preset category;

and processing the sample data based on the category keywords, and determining positive characteristic words and negative characteristic words of the preset category to obtain a word stock, wherein the word stock comprises the category keywords, the positive characteristic words and the negative characteristic words corresponding to the preset category.

7. The method of claim 6, wherein the processing the sample data based on the category keywords to determine positive and negative feature words of the preset category comprises:

8. The method according to claim 1, wherein the word stock includes category keywords corresponding to a plurality of preset categories, positive feature words and negative feature words, the classifying the target text words on the preset categories based on the positive feature words and the negative feature words to obtain classification results of the target category keywords includes:

the step of integrating the classification result of each target category keyword to obtain the category of the text to be classified comprises the following steps:

9. The method of claim 8, wherein each preset category includes a positive sub-category and a negative sub-category, wherein said integrating the classification result of each target category keyword on each preset category, determining the category of the text to be classified, comprises:

10. A text classification device, comprising:

the text classification module is used for classifying the text to be classified according to the preset category, and acquiring a word stock for classifying the text; the positive characteristic word represents the characteristic word belonging to the preset category, and the negative characteristic word represents the characteristic word not belonging to the preset category;

11. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when run on a computer, causes the computer to perform the text classification method according to any of claims 1-10.

12. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the text classification method of any of claims 1-10 when the computer program is executed by the processor.