CN108491406B

CN108491406B - Information classification method and device, computer equipment and storage medium

Info

Publication number: CN108491406B
Application number: CN201810065116.1A
Authority: CN
Inventors: 康平陆; 杨新宇; 陈钦明
Original assignee: Shenzhen Axmtec Co ltd
Current assignee: Shenzhen Axmtec Co ltd
Priority date: 2018-01-23
Filing date: 2018-01-23
Publication date: 2021-09-24
Anticipated expiration: 2038-01-23
Also published as: CN108491406A

Abstract

The application relates to an information classification method, an information classification device, computer equipment and a storage medium. The method comprises the following steps: obtaining information to be classified, and performing word segmentation on the information to be classified to obtain a corresponding original word set; respectively obtaining synonyms corresponding to all original words in the original word set, forming an expanded word set by the original words and the corresponding synonyms, wherein each original word has a corresponding expanded word set; forming an expanded classification information set corresponding to the information to be classified according to the expanded word set corresponding to each original word; and inputting the extended classification information set into a trained multi-classification model to obtain a target class corresponding to the information to be classified.

Description

Information classification method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to an information classification method and apparatus, a computer device, and a storage medium.

Background

With the development of computer technology and the rapid increase of information content, more and more information is obtained, and the communication between users is greatly facilitated. However, because of the huge, various and disorderly information scale, the difficulty is increased for the user to search information and find the content of interest of the user. Information classification is a key technology for organizing and processing a large amount of information, the problem of information disorder can be solved to a certain extent, and users can selectively select information according to their needs.

However, the current information classification algorithm has the problem of low precision, and is difficult to determine the category corresponding to the actual purpose of the user.

Disclosure of Invention

In view of the above, it is necessary to provide an information classification method, an apparatus, a computer device, and a storage medium capable of improving the classification accuracy of information in view of the above technical problems.

A method of information classification, the method comprising:

obtaining information to be classified, and performing word segmentation on the information to be classified to obtain a corresponding original word set;

respectively obtaining synonyms corresponding to all original words in the original word set, forming an expanded word set by the original words and the corresponding synonyms, wherein each original word has a corresponding expanded word set;

forming an expanded classification information set corresponding to the information to be classified according to the expanded word set corresponding to each original word;

and inputting the extended classification information set into a trained multi-classification model to obtain a target class corresponding to the information to be classified.

In one embodiment, the step of generating the trained multi-class model comprises:

acquiring training corpus data, wherein the training corpus data comprises a plurality of training corpus information, and each training corpus information has a corresponding standard category label;

performing word segmentation on each training corpus information to obtain an original training word set corresponding to each training corpus information;

respectively obtaining synonyms corresponding to all original training words in the original training word set, forming an expanded training word set by the original words and the corresponding synonyms, wherein each original training word has a corresponding expanded training word set;

forming an expanded training classification information set corresponding to each training corpus information according to the expanded training word set corresponding to each original training word;

training the multi-classification model through a support vector machine algorithm according to the extended training classification information set corresponding to each training corpus information and the corresponding standard class label;

and obtaining the trained target multi-classification model.

In one embodiment, the multi-classification model includes a plurality of sub-two classification models, and the step of inputting the extended classification information set into the trained multi-classification model to obtain the target class corresponding to the information to be classified includes:

acquiring a first sub-two classification model in the multi-classification model as a current sub-two classification model;

inputting the expanded classification information set into the current sub-binary classification model to obtain corresponding current sub-category information, judging whether a next sub-binary classification model is input according to the current sub-category information, if so, acquiring the next sub-binary classification model, and returning the next sub-binary classification model as the current sub-binary classification model to the step of inputting the expanded classification information set into the current sub-binary classification model;

and if not, taking the category corresponding to the current sub-category information as the target category corresponding to the information to be classified.

In one embodiment, the step of training the multi-classification model through the support vector machine algorithm according to the extended training classification information set corresponding to each training corpus information and the corresponding standard class label includes:

acquiring a characteristic item, and calculating the word frequency weight of the expanded training classification information corresponding to the characteristic item in a first class;

calculating the document frequency of the feature items in the whole training corpus data;

calculating the characteristic weight corresponding to the characteristic item according to the word frequency weight and the document frequency;

selecting the feature items as feature words of the first category according to the feature weights;

and extracting the characteristics of each expanded training classification information in the expanded training classification information set according to the characteristic words.

In one embodiment, before the step of inputting the extended classification information set into the trained multi-classification model to obtain the target class corresponding to the information to be classified, the method further includes:

inputting the extended classification information set into a trained binary classification model to obtain an initial class corresponding to the information to be classified, and inputting the information to be classified into a first module when the initial class is a first preset class;

and when the initial category is a second preset category, entering the step of inputting the extended classification information set into a trained multi-classification model to obtain the category corresponding to the information to be classified.

In one embodiment, the first preset category is a non-service category, the second preset category is a service category, and the step of acquiring the information to be classified includes:

and acquiring banking problems or chat information input by a user in real time.

An information classification apparatus, the apparatus comprising:

the word segmentation module is used for acquiring information to be classified and segmenting words of the information to be classified to obtain a corresponding original word set;

the expansion module is used for respectively acquiring synonyms corresponding to all original words in the original word set, forming an expansion word set by the original words and the corresponding synonyms, wherein each original word has a corresponding expansion word set, and forming an expansion classification information set corresponding to the information to be classified according to the expansion word set corresponding to all the original words;

and the category determining module is used for inputting the expanded classification information set into the trained multi-classification model to obtain a target category corresponding to the information to be classified.

In one embodiment, the apparatus further comprises:

a training module for obtaining training corpus data, wherein the training corpus data includes a plurality of training corpus information, each training corpus information has a corresponding standard class label, each training corpus information is participled to obtain an original training word set corresponding to each training corpus information, synonyms corresponding to each original training word in the original training word set are respectively obtained, the original words and the corresponding synonyms form an extended training word set, each original training word has a corresponding extended training word set, an extended training classification information set corresponding to each training corpus information is formed according to the extended training word set corresponding to each original training word, a multi-classification model is trained through a support vector machine algorithm according to the extended training classification information set corresponding to each training corpus information and the corresponding standard class label, and obtaining the trained target multi-classification model.

A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

The information classification method, the device, the computer equipment and the storage medium have the advantages that the information to be classified is obtained, the information to be classified is segmented to obtain the corresponding original word set, the synonyms corresponding to the original words in the original word set are respectively obtained, the original words and the corresponding synonyms form the expanded word sets, each original word has the corresponding expanded word set, the expanded classification information sets corresponding to the information to be classified are formed according to the expanded word sets corresponding to the original words, the expanded classification information sets are input into the trained multi-classification model to obtain the target classes corresponding to the information to be classified, the expanded word sets corresponding to the original words are formed first, the expanded classification information sets are formed through the expanded word sets, the expansion degree of the expanded classification information is greatly improved, and the expanded classification information expresses the same or similar meanings as the information to be classified, the effective coverage range of the information to be classified is improved, so that the accuracy of the target category can be improved after the trained multi-classification model is input subsequently.

Drawings

FIG. 1 is a diagram of an exemplary environment in which the information classification method may be implemented;

FIG. 2 is a flow diagram illustrating a method for information classification in one embodiment;

FIG. 3 is a schematic flow chart illustrating obtaining a trained target multi-classification model according to one embodiment;

FIG. 4 is a flow diagram illustrating obtaining a target class in one embodiment;

FIG. 5 is a schematic flow chart of feature extraction in one embodiment;

FIG. 6 is a flow chart illustrating a method of information classification in another embodiment;

FIG. 7 is a block diagram showing the structure of an information classification apparatus according to an embodiment;

FIG. 8 is a block diagram showing the construction of an information classifying apparatus according to another embodiment;

FIG. 9 is a block diagram of the structure of a category determination module;

FIG. 10 is a block diagram showing the construction of an information classifying apparatus according to still another embodiment;

FIG. 11 is a diagram illustrating an internal structure of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The information classification method provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The terminal can obtain information to be classified input by a user, the information to be classified is sent to the server 104, the server 104 performs word segmentation on the information to be classified to obtain a corresponding original word set, synonyms corresponding to all original words in the original word set are respectively obtained, the original words and the corresponding synonyms form an expanded word set to obtain an expanded word set corresponding to each original word, an expanded classification information set corresponding to the information to be classified is formed according to the expanded word set corresponding to each original word, the expanded classification information set is input into a trained multi-classification model to obtain a target class corresponding to the information to be classified, before classification is performed, the information to be classified is expanded through the synonyms, each expanded classification information after expansion expresses the meaning which is the same as or similar to the information to be classified, and the effective coverage range of the information to be classified is improved, therefore, after the trained multi-classification model is subsequently input, the accuracy of the target classification can be improved. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.

In one embodiment, as shown in fig. 2, an information classification method is provided, which is described by taking the method as an example applied to the server in fig. 1, and includes the following steps:

step S210, obtaining information to be classified, and performing word segmentation on the information to be classified to obtain a corresponding original word set.

The information to be classified is information needing to specify a target category, and can be text information, voice information or image information, and if the information is the voice information or the image information, the voice information or the image information can be firstly converted into the text information through voice recognition or image recognition. The information to be classified may be information stored in the server, or information submitted by the user and transmitted from the terminal, which is received by the server in real time. Either short text or long text. The information may be declarative information or questioning information, and the questioning information is question information indicating that a corresponding answer exists.

Specifically, the information to be classified is segmented through a segmentation algorithm to obtain each word, and each word forms an original word set. In one embodiment, after each word is obtained, words with small influence on classification, such as stop words, tone words, punctuation marks and the like, are removed, so that the efficiency of subsequent feature extraction is improved. Stop words refer to words in the article that occur more frequently than a predetermined threshold but are of little practical significance, e.g., me, he, etc.

Step S220, synonyms corresponding to all original words in the original word set are respectively obtained, the original words and the corresponding synonyms form an expanded word set, and each original word has a corresponding expanded word set.

The synonym is a word having the same or similar meaning as the original word, for example, when the original word is "and the synonym can be" how long "," what time ", etc., and the original word and the corresponding synonym form an extended word set, for example, when, how long, what time the original word" and the corresponding extended word set are { when, how long, what time }. If the original word set is { a, b, c }, each original word in the original word set has a corresponding extended word set, if a corresponds to the extended word set { a, a1, a2}, b corresponds to the extended word set { b, b1, b2, b3}, and c corresponds to the extended word set { c, c1, c2 }.

Step S230, forming an expanded classification information set corresponding to the information to be classified according to the expanded word set corresponding to each original word.

Specifically, according to the sequence of appearance of each original word in the information to be classified, a word is arbitrarily selected from the expansion word set corresponding to each original word, and an expansion classification information is formed in sequence. When different words are selected from the expansion word set, different expansion classification information is formed, and the expansion classification information set is formed by the different expansion classification information. In one embodiment, a cartesian product is calculated for the expanded term sets corresponding to the original terms to form different expanded classified information to form corresponding expanded classified information sets. The Cartesian product, also called the direct product, of the two sets X and Y is denoted X Y, the first object being a member of X and the second object being one of all the possible ordered pairs of Y.

Step S240, inputting the expanded classification information set into the trained multi-classification model to obtain a target class corresponding to the information to be classified.

In particular, the multi-classification model is used to determine a target class corresponding to an input from a plurality of candidate types based on the input. The multi-classification model may be a model trained by a logistic regression algorithm, a support vector machine algorithm, or the like. The multi-classification model can be internally formed by connecting a plurality of sub-classification models. Because the input of the trained multi-classification model is the expanded classified information set, each expanded classified information expresses the same or similar meaning with the information to be classified, and the effective coverage range of the information to be classified is improved, the accuracy of the target category can be improved after the trained multi-classification model is subsequently input.

In the embodiment, the information to be classified is obtained, the information to be classified is segmented to obtain a corresponding original word set, synonyms corresponding to each original word in the original word set are respectively obtained, the original words and the corresponding synonyms form an expanded word set, each original word has a corresponding expanded word set, an expanded classified information set corresponding to the information to be classified is formed according to the expanded word set corresponding to each original word, the expanded classified information set is input into a trained multi-classification model to obtain a target class corresponding to the information to be classified, the expanded word set corresponding to each original word is formed first, then the expanded classified information set is formed through the expanded word sets, the expansion degree of the expanded classified information is greatly improved, and each expanded classified information after expansion expresses the meaning which is the same as or similar to the information to be classified, the effective coverage range of the information to be classified is improved, so that the accuracy of the target category can be improved after the trained multi-classification model is input subsequently.

In one embodiment, as shown in FIG. 3, the step of generating the trained multi-class model comprises:

step S310, obtaining training corpus data, wherein the training corpus data comprises a plurality of training corpus information, and each training corpus information has a corresponding standard category label.

Specifically, the corpus data may be composed of a plurality of corpus information collected by the server according to the user's historical behavior. And each training corpus information has a corresponding standard category label for describing the actual category of the training corpus information. If the overdraft card long-time payment belongs to the category of 'when the credit card is paid', the standard category corresponding to the training corpus information 'overdraft card long-time payment' is marked as 'when the credit card is paid'. The corpus data includes corpus information corresponding to all candidate categories to ensure accuracy of determination of each category. In one particular embodiment, the corpus data includes 476 questions, with a total number of standard categories of 57.

Step S320, performing word segmentation on each training corpus information to obtain an original training word set corresponding to each training corpus information.

Specifically, each training corpus information is participled through a word segmentation algorithm to obtain each word, and each word forms an original training word set corresponding to each training corpus information. In one embodiment, after each word is obtained, words with small influence on classification, such as stop words, tone words, punctuation marks and the like, are removed, so that the efficiency of subsequent feature extraction is improved. Stop words refer to words in the article that occur more frequently than a predetermined threshold but are of little practical significance, e.g., me, he, etc.

Step S330, synonyms corresponding to all original training words in the original training word set are respectively obtained, the original words and the corresponding synonyms form an expanded training word set, and each original training word has a corresponding expanded training word set.

The synonym is a word having the same or similar meaning as the original training word, for example, when the original training word is "and the synonym can be" how long "," what time ", etc., and the original training word and the corresponding synonym form an extended training word set, for example, when, how long, what time the extended training word set corresponding to the original training word" when "is { when, how long, what time }. If the original training word set corresponding to one piece of corpus information is { a, b, c }, each original training word in the original training word set has a corresponding extended training word set, if the extended training word set corresponding to a is { a, a1, a2}, the extended training word set corresponding to b is { b, b1, b2, b3}, and the extended training word set corresponding to c is { c, c1, c2 }.

Step S340, forming an extended training classification information set corresponding to each training corpus information according to the extended training word set corresponding to each original training word.

Specifically, one piece of corpus information is obtained as current corpus information to be expanded, each current original training word corresponding to the current corpus information to be expanded is obtained, a current expanded training word set corresponding to each current original training word is obtained, then, one word is selected from the current expanded training word set corresponding to each current original training word according to the sequence of appearance of each current original training word in the current corpus information, and a piece of current expanded training classification information is formed according to the sequence. And the different current extension training classification information forms a current extension training classification information set. Each training corpus information has a corresponding extended training classification information set. In one embodiment, a cartesian product is calculated for the extended training word set corresponding to each original training word to form different extended training classification information to form a corresponding extended training classification information set.

And step S350, training the multi-classification model through a support vector machine algorithm according to the extended training classification information set corresponding to each training corpus information and the corresponding standard class label.

In particular, the support vector machine algorithm is a machine learning algorithm for pattern recognition and pattern classification. The support vector machine has the main ideas that: and establishing an optimal decision hyperplane, so that the distance between two types of samples which are closest to the plane on two sides of the plane is maximized, thereby providing good generalization capability for classification problems. For a multidimensional sample set, a system randomly generates a hyperplane and continuously moves, samples are classified until sample points belonging to different classes in training samples are just positioned on two sides of the hyperplane, a plurality of hyperplanes meeting the condition are possible, a support vector machine algorithm finds the hyperplane while ensuring the classification precision, so that blank areas on two sides of the hyperplane are maximized, the optimal classification of linear separable samples is realized, and the support vector machine algorithm is a supervised training method. In one embodiment, the multi-classification model is formed by a plurality of sub-two classification models connected together.

And step S360, obtaining the trained target multi-classification model.

Specifically, a trained target multi-classification model is obtained through the training.

In one embodiment, as shown in fig. 4, the multi-classification model includes a plurality of sub-two classification models, and step S240 includes:

step S241, a first sub-two classification model in the multi-classification model is obtained as a current sub-two classification model.

Specifically, the sub-binary classification model means that the classification result is classified into two categories, wherein one category is a candidate category, and the other category indicates that the input does not belong to the candidate category. Then the next sub-second classification model is required to be input and further judge whether the target candidate class belongs to another candidate class, and the total number of the target candidate classes is the same as the number of the sub-second classification models. And sequentially inputting the expanded classified information set into each secondary classification model so as to determine a target class according to the output.

Step S242, inputting the expanded classification information set into the current sub-binary classification model to obtain corresponding current sub-category information, determining whether to input a next sub-binary classification model according to the current sub-category information, if so, obtaining the next sub-binary classification model, and returning the next sub-binary classification model as the current sub-binary classification model to the step of inputting the expanded classification information set into the current sub-binary classification model in step S242. If not, the process proceeds to step S243.

Specifically, the current sub-binary classification model corresponds to a current candidate category, if the current sub-category information describes that the information to be classified corresponding to the expanded classification information set belongs to the current candidate category, the next sub-binary classification model does not need to be input, and if the current sub-category information describes that the information to be classified corresponding to the expanded classification information set belongs to the current candidate category, the next sub-binary classification model needs to be input to determine whether the information belongs to the next candidate category.

And step S243, using the category corresponding to the current sub-category information as the target category corresponding to the information to be classified.

Specifically, when the next sub-two classification model does not need to be input, the current category described by the current sub-category information is indicated as the target category corresponding to the information to be classified.

In one embodiment, as shown in fig. 5, step S350 includes:

step S351, acquiring the feature item, and calculating the word frequency weight of the extended training classification information corresponding to the first class of the feature item.

Specifically, the feature item may be any word in the extended training classification information corresponding to the first category. The term frequency weight refers to the frequency of occurrence of the feature item in the extended training classification information corresponding to the first category, and it can be understood that if synonyms of the feature item exist in the extended training classification information, the synonyms also appear. The word frequency weights are typically normalized and may be expressed as TF_ijWherein i represents the identifier corresponding to the feature item, and j represents the category identifier.

Step S352, calculating the document frequency of the feature item in the whole corpus data.

In particular, file frequencyRate DF_iThe general importance of the words is measured, and the general importance of the words can be obtained by dividing the number of the expanded training classification information where the feature items are located by the total number of all training corpus information in the training corpus data.

And S353, calculating a feature weight corresponding to the feature item according to the word frequency weight and the document frequency, and selecting the feature item as a first class of feature words according to the feature weight.

Specifically, if the number of times of occurrence of the feature item in the information is greater, it indicates that the influence of the feature item on the information is greater, i.e. the feature weight is proportional to the word frequency weight. If the quantity of the information appearing in the feature item is larger, the effect of the feature item on information classification is smaller, namely the feature weight is inversely proportional to the document frequency. In one embodiment, the feature weights

Where N represents the total number of all corpus information in the corpus data.

If the characteristic weight exceeds a preset threshold value, the characteristic item is an important word of the information, and the characteristic item can be used as the characteristic word of the type.

Step S354, extracting features of each expanded training classification information in the expanded training classification information set according to the feature words.

Specifically, the features of each extended training classification information in the extended training classification information set may be extracted according to each determined feature word. For a category of information, the feature words may include one or more.

In one embodiment, as shown in fig. 6, before step S240, the method further includes:

and S410, inputting the expanded classification information set into the trained binary classification model to obtain an initial class corresponding to the information to be classified, and inputting the information to be classified into a first module when the initial class is a first preset class.

Specifically, the trained binary model is used for selecting between two categories to obtain a target initial category. Through the division of the two candidate categories, the step S240 can be performed only when the preset categories are met, the classified information can be screened, the multi-category classification process can be performed only when the conditions are met, the invalid information is prevented from entering the subsequent classification process, and the classification efficiency is improved. If the initial category is the first preset category, the subsequent multi-classification process is not required to be entered, and only the information to be classified is required to be input into the first module. The function of the first module can be customized as required.

When the initial category is the second preset category, the process proceeds to step S240.

Specifically, only when the initial category is the second preset category, the process proceeds to step S240 to perform a subsequent multi-classification process, and determine the target category.

In one embodiment, the first preset category is a non-service category, the second preset category is a service category, and the step of obtaining the information to be classified includes: and acquiring banking problems or chat information input by a user in real time.

Specifically, the service class indicates a class that is associated with a service and can be classified. The non-service category represents the process which is irrelevant to the service and does not need to enter multi-category classification. The service may be a purchase service, a banking service, a financing service, a reservation service, a communication service, etc. The server can receive a classification request sent by the terminal, wherein the classification request carries banking problems or chat information input by a user in real time. The terminal can receive banking business problems or chat information input by a user in real time through the search box interface, and the chat information can be judged to be of the first preset class through the two classification models due to the fact that randomness is high when the user inputs the information, so that a subsequent classification process of business classes is not needed, information irrelevant to the business is prevented from entering the classification process, and classification efficiency is improved. The banking problem refers to a problem related to banking, and each problem has a corresponding answer to help a user solve difficulties encountered in handling banking.

In a specific embodiment, the information classification method comprises the following specific steps:

1. the adopted corpus data is a question set of a bank in an actual project, the data volume is 479 questions, the existing standard categories are 74 categories, and each category comprises:

[ 'Do you open the account', 'what you go through the card', 'when the card is paid back', '… … ], converting the categories and corresponding quantities into matrices to form shape Counter ['56:36,46:35,42:23,36:22, … … ], where 56 in 56:36 represents the category identification and 36 represents the number of questions under this category.

2. The method comprises the steps of preprocessing original classified data, checking, combining individual similar categories, deleting categories with extremely small data quantity, and finally leaving 476 problems, wherein the total number of the categories is 57, the accuracy after model training can be improved by combining the categories and deleting the categories with extremely small data quantity, experiments show that data classification is carried out according to actual requirements based on a support vector machine algorithm, TF _ IDF is adopted for feature extraction, and a Chinese word segmentation tool is jieba. Respectively carrying out random segmentation on the original data and the preprocessed data by using a training set of 70% and a test set of 30% for 10 times of cross validation, and counting the average accuracy, wherein the result is as follows: the classification accuracy obtained by testing after the original data are trained is 0.422222222222, and the classification accuracy obtained by testing after the preprocessed data are trained is 0.467832167832.

3. The corpus data is expanded through the expansion method of the embodiment to obtain an expanded training classification information set corresponding to each corpus information, and the multi-classification model is trained through a support vector machine algorithm according to the expanded training classification information set corresponding to each corpus information and the corresponding standard class labels to obtain a trained target multi-classification model. Checking the generated category data and deleting the problem that individual grammar is not smooth, and finally obtaining 5225 expanded problems, so that the number of the expanded problems is greatly increased.

4. And classifying data based on a support vector machine algorithm according to actual requirements, wherein TF _ IDF is adopted for feature extraction, and a Chinese word segmentation tool is jieba. And respectively carrying out random segmentation on the extended classification information set for 10 times of cross validation by taking the training set as 70% and the test set as 30%, and counting the average accuracy, wherein the result is as follows: the classification accuracy obtained by performing tests after performing extended training on the data is 0.9435275. Therefore, the classification accuracy is greatly improved after data expansion.

In a specific embodiment, the corpus data includes original chatting corpus, wherein the original chatting corpus is man-machine dialogue corpus, the original corpus has 50W questions and answers, redundant spaces in the questions are processed and connected by commas to form a complete question, only questions with a length greater than or equal to 5 are left, and the total amount of the finally formed chatting corpus is 5308 sentences. The total number of traffic problems is 5225 problems. The finally formed corpus data comprises 5308 chatting sentences and 5225 business problems, and the trained binary model is obtained by training the binary model through the corpus data and is used for distinguishing the chatting information and the business problems. And classifying data based on a support vector machine algorithm according to actual requirements, wherein TF _ IDF is adopted for feature extraction, and a Chinese word segmentation tool is jieba. Respectively carrying out random segmentation on the mixed corpus data for 10 times of cross validation by taking the training set as 70% and the test set as 30%, and counting the average accuracy, wherein the result is as follows: the accuracy rate of the two classifications obtained by testing after the mixed corpus data training of the data is 0.994398734177. The visible binary classification model can accurately distinguish the service problems from the chatting information, so that non-service information is removed.

It should be understood that, although the steps in the above-described flowcharts are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in the above-described flowcharts may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or the stages is not necessarily sequential, but may be performed alternately or alternatingly with other steps or at least a portion of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 7, there is provided an information classification apparatus including:

and the word segmentation module 510 is configured to obtain information to be classified, and perform word segmentation on the information to be classified to obtain a corresponding original word set.

The expansion module 520 is configured to obtain synonyms corresponding to each original word in the original word set, form an expanded word set by using the original words and the corresponding synonyms, where each original word has a corresponding expanded word set, and form an expanded classification information set corresponding to the information to be classified according to the expanded word set corresponding to each original word.

And a category determining module 530, configured to input the expanded classification information set into the trained multi-classification model to obtain a target category corresponding to the information to be classified.

In one embodiment, as shown in fig. 8, the apparatus further comprises:

a training module 540, configured to obtain corpus data, where the corpus data includes a plurality of corpus information, each corpus information has a corresponding standard category label, each corpus information is segmented into original training word sets corresponding to each corpus information, synonyms corresponding to each original training word in the original training word sets are respectively obtained, the original words and the corresponding synonyms form extended training word sets, each original training word has a corresponding extended training word set, an extended training classification information set corresponding to each corpus information is formed according to the extended training word set corresponding to each original training word, and a multi-classification model is trained through a support vector machine algorithm according to the extended training classification information sets corresponding to each corpus information and the corresponding standard category labels, and obtaining the trained target multi-classification model.

In one embodiment, as shown in fig. 9, the multi-classification model includes a plurality of sub-two classification models, and the category determination module 530 includes:

a current sub-binary model determining unit 530a, configured to obtain a first sub-binary classification model in the multi-classification model as a current sub-binary classification model;

the current sub-category information determining unit 530b is configured to input the extended classification information set into the current sub-binary model to obtain corresponding current sub-category information, determine whether to input a next sub-binary model according to the current sub-category information, if so, obtain the next sub-binary model, use the next sub-binary model as the current sub-binary model, return to input the extended classification information set into the current sub-binary model, and otherwise, enter the target category determining unit 530 c.

And the target category determining unit 530c is configured to take the category corresponding to the current sub-category information as the target category corresponding to the information to be classified.

In an embodiment, the training module 540 is further configured to obtain a feature item, calculate a word frequency weight of the extended training classification information corresponding to the feature item in the first category, calculate a document frequency of the feature item in the entire training corpus data, calculate a feature weight corresponding to the feature item according to the word frequency weight and the document frequency, select the feature item as a feature word of the first category according to the feature weight, and extract features of each extended training classification information in the extended training classification information set according to the feature word.

In one embodiment, as shown in fig. 10, the apparatus further comprises:

the classification module 550 is configured to input the expanded classification information set into a trained classification model to obtain an initial class corresponding to the information to be classified, input the information to be classified into the first module when the initial class is a first preset class, and enter the class determination module 530 when the initial class is a second preset class.

In one embodiment, the first preset category is a non-business category, the second preset category is a business category, and the word segmentation module 510 is further configured to obtain banking questions or chat information input by the user in real time.

The modules in the information classification device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 11. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement the information classification method described in the above embodiments.

Those skilled in the art will appreciate that the architecture shown in fig. 11 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, there is provided a computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program: the method comprises the steps of obtaining information to be classified, carrying out word segmentation on the information to be classified to obtain a corresponding original word set, respectively obtaining synonyms corresponding to all original words in the original word set, enabling the original words and the corresponding synonyms to form an expanded word set, enabling each original word to have a corresponding expanded word set, forming an expanded classification information set corresponding to the information to be classified according to the expanded word set corresponding to all the original words, and inputting the expanded classification information set into a trained multi-classification model to obtain a target class corresponding to the information to be classified.

In one embodiment, the generation of the trained multi-classification model comprises: obtaining training corpus data, wherein the training corpus data comprises a plurality of training corpus information, each training corpus information has a corresponding standard class label, segmenting each training corpus information to obtain an original training word set corresponding to each training corpus information, respectively obtaining synonyms corresponding to each original training word in the original training word set, forming an expanded training word set by the original words and the corresponding synonyms, wherein each original training word has a corresponding expanded training word set, forming an expanded training classification information set corresponding to each training corpus information according to the expanded training word set corresponding to each original training word, and training the multi-classification model through a support vector machine algorithm according to the extended training classification information set corresponding to each training corpus information and the corresponding standard class label to obtain the trained target multi-classification model.

In one embodiment, the multi-classification model includes a plurality of sub-two classification models, and the obtaining of the target class corresponding to the information to be classified by inputting the extended classification information set into the trained multi-classification model includes: the method comprises the steps of obtaining a first sub-two classification model in a multi-classification model as a current sub-two classification model, inputting an expanded classification information set into the current sub-two classification model to obtain corresponding current sub-category information, judging whether a next sub-two classification model is input according to the current sub-category information, if so, obtaining a next sub-two classification model, using the next sub-two classification model as the current sub-two classification model, returning the step of inputting the expanded classification information set into the current sub-two classification model, and if not, using a category corresponding to the current sub-category information as a target category corresponding to information to be classified.

In one embodiment, training the multi-classification model through the support vector machine algorithm according to the extended training classification information set corresponding to each training corpus information and the corresponding standard class label includes: the method comprises the steps of obtaining a feature item, calculating the word frequency weight of the extension training classification information corresponding to the feature item in a first category, calculating the document frequency of the feature item in the whole training corpus data, calculating the feature weight corresponding to the feature item according to the word frequency weight and the document frequency, selecting the feature item as a feature word of the first category according to the feature weight, and extracting the feature of each extension training classification information in an extension training classification information set according to the feature word.

In one embodiment, before the processor inputs the extended classification information set into the trained multi-classification model to obtain the target class corresponding to the information to be classified, the processor further executes a computer program to implement the following steps: inputting the expanded classification information set into a trained binary classification model to obtain an initial class corresponding to the information to be classified, and inputting the information to be classified into a first module when the initial class is a first preset class; and when the initial category is a second preset category, entering a step of inputting the expanded classification information set into the trained multi-classification model to obtain a category corresponding to the information to be classified.

In one embodiment, the first preset category is a non-service category, the second preset category is a service category, and the obtaining of the information to be classified includes: and acquiring banking problems or chat information input by a user in real time.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of: the method comprises the steps of obtaining information to be classified, carrying out word segmentation on the information to be classified to obtain a corresponding original word set, respectively obtaining synonyms corresponding to all original words in the original word set, enabling the original words and the corresponding synonyms to form an expanded word set, enabling each original word to have a corresponding expanded word set, forming an expanded classification information set corresponding to the information to be classified according to the expanded word set corresponding to all the original words, and inputting the expanded classification information set into a trained multi-classification model to obtain a target class corresponding to the information to be classified.

In one embodiment, the generation of the trained multi-classification model comprises the steps of: obtaining training corpus data, wherein the training corpus data comprises a plurality of training corpus information, each training corpus information has a corresponding standard class label, segmenting each training corpus information to obtain an original training word set corresponding to each training corpus information, respectively obtaining synonyms corresponding to each original training word in the original training word set, forming an expanded training word set by the original words and the corresponding synonyms, wherein each original training word has a corresponding expanded training word set, forming an expanded training classification information set corresponding to each training corpus information according to the expanded training word set corresponding to each original training word, and training the multi-classification model through a support vector machine algorithm according to the extended training classification information set corresponding to each training corpus information and the corresponding standard class label to obtain the trained target multi-classification model.

In one embodiment, before the computer program is executed by the processor to input the extended classification information set into the trained multi-classification model to obtain the target class corresponding to the information to be classified, the following steps are further implemented: inputting the expanded classification information set into a trained binary classification model to obtain an initial class corresponding to the information to be classified, and inputting the information to be classified into a first module when the initial class is a first preset class; and when the initial category is a second preset category, entering a step of inputting the expanded classification information set into the trained multi-classification model to obtain a category corresponding to the information to be classified.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of information classification, the method comprising:

acquiring information to be classified, wherein the information to be classified is banking business problems or chat information input by a user in real time, and performing word segmentation on the information to be classified to obtain a corresponding original word set;

inputting the expanded classified information set into a trained binary model to obtain an initial class corresponding to the information to be classified, and inputting the information to be classified into a first module when the initial class is a first preset class, wherein the first preset class is a non-service class, the second preset class is a service class, the service class represents a class which is related to service and is classified, and the service comprises purchase service, banking service, financing service, predetermined service or communication service;

when the initial category is a second preset category, inputting the extended classification information set into a trained multi-classification model to obtain a target category corresponding to the information to be classified, wherein the multi-classification model comprises a plurality of sub-two classification models, and a first sub-two classification model in the multi-classification model is obtained and used as a current sub-two classification model; inputting the expanded classification information set into the current sub-binary classification model to obtain corresponding current sub-category information, judging whether a next sub-binary classification model is input according to the current sub-category information, if so, acquiring the next sub-binary classification model, and returning the next sub-binary classification model as the current sub-binary classification model to the step of inputting the expanded classification information set into the current sub-binary classification model; and if not, taking the category corresponding to the current sub-category information as the target category corresponding to the information to be classified.

2. The method of claim 1, wherein the step of generating the trained multi-class model comprises:

and obtaining the trained target multi-classification model.

3. The method of claim 1, wherein the sub-binary classification model means that the classification result is two classes, one of the classes is a candidate class, the other class indicates that the input does not belong to the candidate class, and the total number of target candidate classes is the same as the number of sub-binary classification models.

4. The method according to claim 2, wherein the step of training the multi-classification model through the support vector machine algorithm according to the extended training classification information set and the corresponding standard class label corresponding to each training corpus information comprises:

5. An information classification apparatus, characterized in that the apparatus comprises:

the word segmentation module is used for acquiring information to be classified, wherein the information to be classified is banking business problems or chat information input by a user in real time, and performing word segmentation on the information to be classified to obtain a corresponding original word set;

an expansion module, configured to obtain synonyms corresponding to each original word in the original word set, respectively, form an expanded word set from the original words and the corresponding synonyms, where each original word has a corresponding expanded word set, forming an expanded classification information set corresponding to the information to be classified according to the expanded word set corresponding to each original word, inputting the expanded classification information set into a trained binary classification model to obtain an initial class corresponding to the information to be classified, when the initial category is a first preset category, inputting the information to be classified into a first module, the first preset category is a non-business category, the second preset category is a business category, the business category represents a category which is related to business and is classified, and the business comprises purchase business, banking business, financing business, reservation business or communication business;

a category determining module, configured to, when the initial category is a second preset category, input the extended classification information set into a trained multi-classification model to obtain a target category corresponding to the information to be classified, where the multi-classification model includes a plurality of sub-two classification models, and the category determining module includes: a current sub-binary classification model determining unit, configured to obtain a first sub-binary classification model in the multi-classification model as a current sub-binary classification model; the current sub-category information determining unit is used for inputting the expanded classification information set into a current sub-binary classification model to obtain corresponding current sub-category information, judging whether a next sub-binary classification model is input or not according to the current sub-category information, if so, acquiring the next sub-binary classification model, taking the next sub-binary classification model as the current sub-binary classification model, returning to input the expanded classification information set into the current sub-binary classification model, and otherwise, entering the target category determining unit; and the target category determining unit is used for taking the category corresponding to the current sub-category information as the target category corresponding to the information to be classified.

6. The apparatus of claim 5, further comprising:

7. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 4 are implemented when the computer program is executed by the processor.

8. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 4.