CN107766371B

CN107766371B - Text information classification method and device

Info

Publication number: CN107766371B
Application number: CN201610693358.6A
Authority: CN
Inventors: 周晶
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2016-08-19
Filing date: 2016-08-19
Publication date: 2023-11-17
Anticipated expiration: 2036-08-19
Also published as: CN107766371A; WO2018032937A1

Abstract

The embodiment of the invention provides a text information classification method and a text information classification device, which provide a matching basis for classifying text information to be classified later by presetting a sample keyword information set of a text class and establishing a corresponding relation between the sample keyword information set and the text class information; when classifying the text information to be classified, extracting keyword information from the text information to be classified according to a preset rule, and matching text category information corresponding to the text information to be classified according to the corresponding relation between the sample keyword information set and the text category; the information classification mode only needs to carry out automatic matching of the system, greatly improves the classification processing efficiency, shortens the analysis period, reduces the error of manual distribution and improves the matching accuracy.

Description

Text information classification method and device

Technical Field

The invention relates to the technical field of text information classification, in particular to a text information classification method and a text information classification device.

Background

With the development of information classification technology, information processing departments in each enterprise receive or accumulate massive information every day, and in some cases, information of a certain category needs to be extracted from the information, but because a direct corresponding relation is not established between the information and the category, the information cannot be directly extracted by search engine retrieval. The existing method for classifying information usually adopts a manual mode to analyze the information piece by piece, thus the method can take a lot of manpower and labor. Meanwhile, along with the continuous increase of the quantity of the interactive information or the continuous accumulation of related works every day, at this time, if the information is required to be processed in the same time in a high quality, the processing speed of the staff is required to be improved or more manpower resources are input, but the current mode of adopting manpower is difficult to achieve the same requirement of efficiency and quality, because the classification by artificial intelligence, the same cognition of each staff on the information category cannot be ensured, and the recall ratio of the information is also different to a certain extent during classification, so that the classification accuracy is lower.

Disclosure of Invention

The text information classification method and the text information classification device provided by the embodiment of the invention aim to solve the technical problems of long analysis period, low working efficiency and low recall ratio caused by classifying and processing text information mainly by a manual mode in the prior art.

In order to solve the above technical problems, an embodiment of the present invention provides a text information classification method, including:

acquiring text information to be classified;

extracting a keyword information set from the text information to be classified according to a preset rule, wherein the keyword information set comprises at least one keyword information;

matching text category information corresponding to the keyword information set according to the keyword information set and the corresponding relation between the preset sample keyword information set and the text category information;

and classifying the text information to be classified according to the matched text category information.

The embodiment of the invention also provides a text information classification device, which comprises: the device comprises an acquisition module, an extraction module, a matching module and a classification module;

the acquisition module is used for acquiring text information to be classified;

the extraction module is used for extracting a keyword information set from the text information to be classified according to a preset rule, wherein the keyword information set comprises at least one keyword information;

The matching module is used for matching text category information corresponding to the keyword information set according to the keyword information set and the corresponding relation between the preset sample keyword information set and the text category information;

the classifying module is used for classifying the text information to be classified according to the matched text category information.

The embodiment of the invention also provides a computer storage medium, wherein the computer storage medium is stored with computer executable instructions, and the computer executable instructions are used for executing the text information classification method.

The beneficial effects of the invention are as follows:

according to the text information classification method, the text information classification device and the computer storage medium provided by the embodiment of the invention, through presetting a sample keyword information set of a text category and establishing a corresponding relation between the sample keyword information set of the text category and the text category information, a matching basis is provided for classifying the text information to be classified subsequently, and the possibility is provided for realizing automatic matching classification; furthermore, when the text information to be classified is classified, the keyword information set is extracted according to the preset rule, the information which is extracted from the text information to be classified and can embody the text category is matched and identified with the preset sample keyword information set, and the corresponding text category information is obtained, so that the automatic identification and matching of the text information to be classified by the system are realized, the matching mode only needs to be automatically matched by the system, the classification processing efficiency is greatly improved, and the analysis period is shortened. The method classifies the sample keyword information sets in a fixed correspondence in a matching mode, so that errors of manual distribution are reduced, and matching accuracy is improved.

Drawings

Fig. 1 is a flowchart of a text information classification method according to a first embodiment of the present invention;

FIG. 2 is a flowchart illustrating a process of classifying a user by a client using a text information classification method according to a second embodiment of the present invention;

FIG. 3 is a flowchart of a keyword information set for expanding text categories according to a second embodiment of the present invention;

FIG. 4 is a flowchart of a learning process of a classification model according to a second embodiment of the present invention;

fig. 5 is a schematic diagram of implementing text information classification through interaction between a browser and a server according to a second embodiment of the present invention;

FIG. 6 is a flow chart of a process for classifying a single piece of text information according to a third embodiment of the present invention;

FIG. 7 is a flow chart of a process for classifying batch text information according to a third embodiment of the present invention;

fig. 8 is a schematic structural diagram of a text information classification apparatus according to a fourth embodiment of the present invention.

Detailed Description

The embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.

First embodiment:

in order to solve the problems of low working efficiency and low accuracy caused by manual operation in the prior art for classifying information, the embodiment of the invention discloses a text information classifying method and system, which are used for extracting a keyword information set from acquired text information to be classified according to a preset rule, matching text category information corresponding to the keyword information set of the text information to be classified according to the extracted keyword information set and the corresponding relation between the preset sample keyword information set and the text classifying information, and finally classifying the text information to be classified according to the text category information, thereby realizing automatic classifying operation of the text information, greatly improving the working efficiency, classifying accuracy and the like.

Referring to fig. 1, fig. 1 is a process flow chart of a text information classification method according to the present embodiment.

The text information classification method provided in this embodiment includes the following steps:

s101, obtaining text information to be classified.

Preferably, the acquired text information to be classified comprises at least one text information, and the at least one text information can be of the same text type or of different text types;

in this embodiment, the obtained text information to be classified may also be text information obtained by converting other types of information, such as voice, video information, and the like, and when the obtained information is voice, the obtained text information corresponding to the voice is obtained in the information classification process, specifically, the voice is converted into text information through a voice text recognition plug-in; similarly, for other types of non-text information, the conversion plug-in is required to convert the information to obtain corresponding text information.

S102, extracting a keyword information set from the text information to be classified according to a preset rule, wherein the keyword information set comprises at least one keyword information.

In this embodiment, after obtaining text information to be classified, the text information to be classified is processed according to a preset rule, specifically, word segmentation is performed on the text information to be classified according to a word segmentation processing technology, the text information to be classified is segmented into at least one keyword information after segmentation, and the keyword information obtained by segmentation is collected to form a keyword information set of the text information to be classified.

Preferably, when the text information is segmented according to the word segmentation technology, punctuation marks in the text information to be classified are removed, and then the segmentation of keywords is carried out according to the original sequence of the text information to be classified.

In this embodiment, at least one keyword is obtained after the segmentation is completed, but not all keywords can make substantial contribution to text information classification, and some keywords exist in all categories, such as, for example, a term, a digital term, a graduated term, a time term, and the like, and these terms are almost suitable for all information interactions.

The text information is extracted by a word segmentation technique, for example, the text information is segmented by 'Baozhen Hotel No. 2 Baozhen 18:00, wang Zongye', useless symbols, such as punctuation marks and abnormal symbols, are removed before the word segmentation, and the text information becomes 'Baozhen Hotel No. 2 Baozhen 18:00 Wang Zongye after the punctuation is removed'; then, word segmentation is carried out, and the word segmentation is carried out to become: "evening/euphoria hotel/number 2/box/18:00/dot/Wang Zong/also/coming"; in the text information, keyword information capable of representing the category of the text information is hotel and box, and the hotel and box are extracted from the text information to form a keyword information set.

And S103, matching text category information corresponding to the keyword information set according to the keyword information set and the corresponding relation between the preset sample keyword information set and the text category information.

After extracting the keyword information set of the text information to be classified, classifying the text information according to the obtained keyword information set. Preferably, a preset sample keyword information set of each text category is obtained, then the keyword information set extracted from the text information to be classified is respectively matched with the sample keyword information set of each text category, text category information is obtained according to the corresponding relation between the preset sample keyword information set and the text category, specifically, whether the keyword information in the text information to be classified exists in the sample keyword information set of the text category is inquired, if so, the corresponding sample keyword information set of the text category is marked, and finally, the corresponding text category information is identified according to the mark.

In this embodiment, the correspondence between the preset keyword information and the text category is specifically obtained by: classifying a plurality of sample text information obtained in advance, extracting keyword information of each sample text information in each text category after classification, and forming the sample keyword information set; and establishing a corresponding relation between a sample keyword information set extracted from sample text information of the same text category and the text category information.

Specifically, the sample text information existing on the system is firstly obtained, and the sample text information can be historical text information in the system or a classified text information template obtained by downloading the system from the internet.

When the sample text information is the historical text information in the system, staff firstly marks the types of the sample text information according to the content of the sample text information, and after marking, classification statistics of the types of the sample text information is carried out according to the marks, and all the sample text information is stored in a distinguishing mode according to the types; further, word segmentation is carried out on each category of sample text information according to the category, a category keyword information set is extracted, and finally, the corresponding relation between the keyword information set extracted from the sample text information and the corresponding text category information is established.

In the present embodiment, step S103 includes: matching each keyword information in the keyword information set with a sample keyword information set corresponding to each preset text category information to obtain original first character strings corresponding to each sample keyword information set one by one or obtain original second character strings formed by arranging each original first character string according to a preset sequence; the original first character string comprises characters 0 and/or characters 1, the position sequence of each character 0 and 1 corresponds to the position sequence of each keyword information of each text category in a corresponding sample keyword information set one by one, the character 0 indicates that the keyword information of the text information to be classified does not exist in the sample keyword information set, and the character 1 indicates that the keyword information of the text information to be classified exists in the sample keyword information set; and identifying text category information corresponding to the keyword information set according to the obtained original character string.

Specifically, whether the keyword information of the text information to be classified exists in each preset sample keyword information set is queried according to the keyword information, if so, the keyword corresponding to the preset sample keyword information set which is queried currently is marked as 'exist', the other keywords are marked as 'not exist', finally, after the query is completed, an original character string composed of characters 0 or 1 is output, the sequence and the position of the characters 0 or 1 in the output character string are output according to the original sequence of the keyword information in the sample keyword information set, for example, the keyword sequence of the sample keyword information set in the category of 'bank transaction' is [ account number of expenditure expense of income bank ] and the keyword information set of the text information to be classified is "account number of consumption" bank ", and the positions of 0 and 1 in the character string output during matching are output according to the sequence of the account number of expenditure of income bank ] and the obtained character string is [ 0 0 1 0 1 0 1 ].

Further, when the corresponding text category information is identified according to the labels, specifically, according to the output character string analysis, the corresponding text category information is obtained according to the analysis result, preferably, the character strings are analyzed and specifically classified according to the number of the labels "exist" in the keyword information set of each text category, and the more the labels, the more likely the labels are.

Further, in this embodiment, the original character string includes an original first character string or an original second character string, where the original first character string may be understood as a character string that is output by matching with a sample keyword information set corresponding to one text category, and the original second character string is a character string that is output by matching with a plurality of text categories. When the original first character strings corresponding to the keyword information sets of the samples one by one are obtained, the original character strings are the original first character strings, and classification analysis is carried out according to the original first character strings during classification processing.

When the original second character strings formed by arranging the original first character strings according to the preset sequence are obtained, the original character strings are the original second character strings, and classification analysis is carried out according to the original second character strings during classification processing.

For example, three text categories of "banking transaction", "meal office", "engineering project" are preset, and the respective keyword information sets are shown in the following table 1:

table 1 correspondence between a sample keyword information set and text category information

For the example above "evening/euphoria hotel/number 2/box/18:00/dot/Wang Zong/also/come". After the text information word segmentation technology is carried out on the sentence, the above mentioned sample keyword information set is adopted to carry out keyword matching, for example, the large sample keyword information set consisting of three types of text categories of ' banking transaction ', ' meal office ', ' engineering project ', ' in-and-out bank expenditure consumption and withdrawal account credit card in-and-out … … dining small-sized, empty-pack, head-of-the-way hotel, pack, room, evening party, wine shop, order hall, order taking, gathering … … engineering money, paying, accounting, repayment loan, etc. ' ten thousand … … '

Analyzing the text information, namely 'evening/euphoria hotel/number 2/box/18:00/dot/Wang Zong/also/come', and extracting keyword information 'hotel', 'box', 'come' from the text information to be the content in a preset sample keyword information set.

The character strings output after the matching is completed are as follows:

【0 0 0 0 0 0 0 0 0……0 0 0 1 0 1 0 0 0 0 0 1……0 0 0 0 0 0 0 0 0……】

by analyzing the character strings, the text category to which the text information is assigned is obtained as "meal office".

In order to further improve accuracy of text category information matched with text information to be classified, the embodiment further includes, after completing matching of the keyword information set:

correcting the obtained original character string according to a classification model obtained by pre-learning to obtain a final character string, and replacing the original character string by the final character string.

Firstly, matching a keyword information set with a preset sample keyword information set to obtain character strings of the keyword information set of the text information to be classified; performing model training learning on the character strings according to a classification model obtained by pre-learning; and obtaining corresponding text category information according to the result of model training and learning.

In this embodiment, the classification model is specifically obtained by: acquiring a sample keyword information set of sample text information of each text category; matching each keyword information set with a corresponding sample keyword information set of each text category respectively, and outputting a corresponding character string; learning the character strings according to a preset training learning algorithm to obtain a classification model; and establishing a corresponding relation between the classification model and the text classification information. Preferably, the training learning algorithm adopts a random forest classification learning algorithm, and the obtained classification model is a random forest classification model.

S104, classifying the text information to be classified according to the matched text category information.

In this embodiment, in order to solve the problem that when a query keyword information set is matched, a classification error occurs due to a low recall ratio caused by that keywords in a sample keyword information set of each text category are not comprehensive, the embodiment further includes when the sample keyword information set of each text category is created: the keyword information sets are expanded by using domain vocabulary, so that the sample keyword information sets in each text category can more comprehensively contain keywords in the corresponding category.

In this embodiment, the steps may be implemented by a processor on the mobile terminal, specifically, layer sequence codes implementing the functions of the steps may be written in a memory, and the steps may be read and executed by the processor.

According to the text information classification method provided by the embodiment of the invention, the sample keyword information set of the text category is created according to the sample text information, and the corresponding relation between the sample keyword information set and the text category is established, when the text information to be classified is classified, the keyword information is extracted according to the preset rule, and the text category information corresponding to the text information to be classified is matched according to the corresponding relation between the preset sample keyword information set and the text category; the text category information is matched through the keywords, so that the operation steps of classifying the information are simplified, and the problems of long analysis period and low working efficiency caused by classifying the text information manually are further solved.

Furthermore, the embodiment of the invention also classifies the text information in a classification model mode, and adopts a training and learning method to automatically identify and classify the text information.

Second embodiment:

referring to fig. 2, fig. 2 is a flowchart of a process for classifying a user by using a text information classification method through a client according to the present embodiment.

The embodiment is a text information classification method obtained by combining a client and a specific application scene, and the processing steps are as follows:

s201, sample text information on the client is obtained, label classification is carried out, and a corresponding relation between a sample keyword information set and text category information is established.

In this step, text information is collected for creating a text category keyword information set, and the collected text information is used as sample text information for creating the keyword information set, where the text information may be historical text information received before the client, or may be chat information such as WeChat, QQ, and chat information on some applications or terminals that have been previously classified.

In the present embodiment, it is assumed that the acquired sample text information is as follows:

"100 tens of thousands of engineering charges.

"your tail 44XX credit card 02 month 21 day 13:56 consumption 19,089.59 Yuan [ construction Bank ].

"Shangyuan Jiajie, and snack flavor, heart and people are also reunion".

"chinese movement alert you: the yin of the night is rainy and gradually stopped, and the yin changes into cloudy in the open day.

"Xinxin evening hotel number 2 Baote 18:00 point, wang Zongye coming".

The labeling is performed according to the obtained sample text information, and the labeling is performed in order to distinguish types, and the labeling mode can be manual labeling or automatic labeling.

In the labeling process, the essence is to solidify the knowledge of industry experts, and realize the fixation of the corresponding relation between the category and the keyword, so that the text information can be more accurately classified later. In this embodiment, the above-mentioned sample text information is classified into three categories, "banking transaction", "engineering project", "meal office", and specific label classifications are shown in table 2 below.

Table 2 correspondence of sample text information and category labels

The labels of the table 2 can be further marked in the table by presetting the corresponding relation between each category and the number and then marking the corresponding relation in the form of the number.

In this embodiment, for selecting sample text information, the sample text information can only be obtained according to the category range required by the service personnel, and from the principle of creating a sample keyword information set, the more and better the sample text information is, the more keywords of the final sample keyword information set are, but in actual operation, the problem of workload is involved, only a small portion of samples can be obtained when the sample text information is obtained, and thus, each created sample keyword information set cannot completely represent a feature of a category.

Therefore, in order to obtain a more complete category sample keyword information set, the expansion of keywords is performed after obtaining the sample keyword information set of each category according to the sample text information.

S202, expanding the sample keyword information set classified according to the step S201.

In this step, the sample text information in each classified category is further subdivided, preferably, the keyword extraction is performed on the sample text information in each classified category, and the text information with the keyword outside is obtained according to the extracted keyword. The processing steps for the expanded keyword information set of this step are as shown in fig. 3:

S301, extracting keywords in sample information of each category according to the category.

For example: words such as consumption, banks and the like are extracted from sample text information' your tail number 44XX credit card 02 month 21 day 13:56 consumption 19,089.59 Yuan [ construction banks ] in the banking transaction category as keywords.

Words such as hotel, box and coming are extracted from sample text information in the meal office category, namely, 18:00 points of box number 2 of the hotel at night, wang Zongye, and the words are used as keywords.

S302, collecting external text information with keywords in the sample information of the category according to the keywords extracted from the various pieces of text information.

In this example, three topics including "banking transaction", "meal office" and "engineering project" are set, in practical application, the expression mode of text information of each category is various, for each category, a large category, for example, a short message of "banking transaction" must be dug, and the short message formats of different banks are different, and at this time, the short messages of each big bank with "consumption" and "banking" keywords on the internet should be queried according to keywords "consumption" and "banking" extracted from sample information.

S303, extracting other keywords in the text information obtained in the step S302, and adding the other keywords into the keyword information set of the corresponding category. The keywords obtained by each category expansion are summarized, and the obtained sample keyword information set is shown in the following table 3.

TABLE 3 sample keyword information set of various classes after expansion

S203, obtaining text information to be classified, segmenting the text information to be classified, and extracting keyword information. Specifically, after punctuation marks in text information are removed, keyword segmentation is carried out according to the solar calendar sequence of the content of the text information, and at least one keyword information is obtained through segmentation.

S204, matching each keyword information set with the corresponding keyword information set of the text category, and outputting the corresponding character string.

In the step, matching inquiry is carried out according to the keywords of the text information after word segmentation, and a matching result of the text information is output, and the specific steps are as follows:

and step A, after the text information is segmented, the keywords which appear are expressed by 1, and the keywords which do not appear are expressed by 0.

For example: "evening/euphoria hotel/number 2/box/18:00/dot/Wang Zong/also/come/go". After the text information word segmentation technology is carried out on the sentence, the keyword information set is adopted to carry out keyword matching, for example, the keyword information sets of the three types of topic classification "banking transaction", "meal bureau" and "engineering project" are 500, and the output result of the keyword information matching of the text information is a 500-dimensional vector expression mode.

[ transfer into income bank and pay out and consume and withdraw account credit card transfer out and transfer to … … dining small-sized empty packet box and head-of-head hotel, restaurant, package house and evening party wine bank hall to eat food-sized … … engineering money and pay out and pay back loan borrow … … ]

By analyzing the text information, namely 'evening/Xinxin hotel/No. 2/box/18:00/dot/Wang Zong/also/come', the keywords 'hotel', 'box', 'come' in the text information can be identified to be the content in the keyword information set with the 'meal bureau' category.

The 500-dimensional string is: [ 00 00 00 00 0 … … 00 0 1 0 1 00 00 0 1 … … 00 00 … … ] of the following

And B, forming a keyword character string of the text information, wherein the content is the 500-dimensional character string:

【0 0 0 0 0 0 0 0 0……0 0 0 1 0 1 0 0 0 0 0 1……0 0 0 0……】

what the dimension is, the expression is based on the actual text category keyword information set.

S205, analyzing according to the output character strings to obtain text category information corresponding to the text information to be classified, and performing classification processing.

In this embodiment, the classifying processing for the text information in step S205 specifically includes: performing classification processing according to a classification model obtained by pre-learning, wherein the classification model is obtained by learning sample text information, and the processing steps are as follows:

Step one, obtaining character strings which are output according to matching of each keyword information set and the corresponding text type keyword information set.

Training and learning the character strings according to a classification model obtained by pre-learning.

And thirdly, acquiring corresponding text category information according to training and learning results.

In this embodiment, the classification model is specifically obtained by learning in a manner, and the processing steps thereof are shown in fig. 4.

S401, acquiring a keyword information set of sample text information of each text category.

S402, matching each keyword information set with the corresponding keyword information set of the text category, and outputting the corresponding character string.

S403, taking the character strings of the text information of each sample as learning input of the model, and learning to obtain a classification model of the corresponding category.

S404, establishing a corresponding relation between the classification model and the text classification information.

In this embodiment, it is preferable to learn a random forest model, receive the character string and text category information of the input text information as input samples in the model training stage, train the model by using a random gradient method, and output and store random forest model parameters after the training error reaches a certain threshold value.

The random forest is a relatively common machine learning model with relatively good effect, a plurality of relatively simple decision trees are used for training simultaneously, the classification results of all the decision trees are voted according to the principle that a minority obeys the majority, and the voting results are used as the final output of the model.

In this embodiment, the text information classification method is applied to a client, so as to implement information classification of the client, and may also be applied to an interface access system based on a browser mode, as shown in fig. 5:

s501, the user accesses the user interface through the browser.

S502, the user interface and the WEB server perform message interaction, corresponding commands are issued, the commands comprise keyword information sets of the category of the creative text and expansion keyword information sets, and the analysis processing result is visually displayed through the interface.

S503, the WEB server actually issues instructions through the REST service, including algorithm training, single text analysis, batch text analysis and the like.

S504, carrying out algorithm training by using REST service and machine learning algorithm processing, and training by using a random forest algorithm model.

S505, training a machine learning algorithm model into a plurality of classifiers according to topic classification, so that the text information classification processing is convenient to use later.

S506, the REST service adopts different classifiers to analyze the text to be analyzed.

And S507, finally, carrying out information classification by text analysis through the collaborative operation and internal algorithm learning.

Third embodiment:

referring to fig. 6, fig. 6 is a flowchart of a process for classifying a single text message according to the present embodiment, where the process steps are as follows:

s601, a single text message is input, for example, a single text "tomorrow evening holy restaurant president package is eaten. "

S602, word segmentation is carried out on the input text information, and word segmentation is carried out on the text after punctuation marks are removed: "tomorrow/evening/jinling/restaurant/president/restroom/meal/".

S603, splitting each text into a plurality of words, wherein the text is split into a plurality of words: "tomorrow", "evening", "jinling", "restaurant", "president", "package" and "eat".

S604, extracting keywords from the text information, performing vectorization analysis by using a category keyword information set in the system, wherein the existing keyword information set is represented by 1, and the non-existing keyword information set is represented by 0.

By using the above category keyword information sets, keyword information sets existing as "restaurant", "package" and "eat" are found. Keyword strings such as:

【0 0 0 0 0 0 0 0 0……0 0 0 0 0 0 10 1 0 0 0 1 0……0 0 0 0……】

S605, forming a character string of the key words, which can be used as classification analysis subsequently.

The character string is as follows:

【0 0 0 0 0 0 0 0 0……0 0 0 0 0 0 10 1 0 0 0 1 0……0 0 0 0……】

s606, performing classification analysis on the single text.

For example, entering a single text "tomorrow evening is taken between the holy restaurant presidents. "such information can be analyzed into information about the case, and the text category is" meal office ".

As shown in fig. 7, a process flow diagram for classifying bulk text information provided in this embodiment is provided, and bulk text analysis is a loop of the process of a single text. Batch text analysis requires a batch text upload process, which includes the following steps:

and S701, uploading the batch text information to an analysis system.

Here, a simple example is:

1. "you tail number 65XX card 2 days 16:13 business point expenditure (withdrawal) 130,000 yuan, balance 3,125.97 yuan, available balance 53,125.97 yuan. [ Industrial and commercial Bank ].

2. 100 tens of thousands of money to account "

3. "you play at home, in the morning and evening: apricot forest wine home, no. 1 package. … … … …

S702, word segmentation is carried out on each piece of text information.

S703, forming a keyword vector table, which can be used as a classification analysis.

And S704, performing classification analysis on each piece of text information.

For the batch text exemplified above, the classification of the above three categories "banking transactions", "dining bureau" and "engineering projects" can be performed as shown in table 4 below.

Table 4 text information correspondence table after classification

The system automatically classifies the newly appeared text information based on automatic machine learning.

By implementing the scheme provided by the embodiment, the method for establishing the category keyword information set in an automatic mode instead of a manual operation mode in practical application is realized, and the text information classification analysis of the system is realized by adopting a machine learning training method, so that the working efficiency is greatly improved.

Fourth embodiment:

referring to fig. 8, fig. 8 is a schematic structural diagram of a text information classification system according to the present embodiment. The text information classifying apparatus 8 provided in the present embodiment includes: an acquisition module 81, an extraction module 82, a matching module 83 and a classification module 84, wherein:

the obtaining module 81 is configured to obtain text information to be classified, and preferably, the obtained text information to be classified includes at least one text information, where the at least one text information may be of a same text category or different text categories; in addition, the text information may be text information converted from other types of information, such as voice, video information, etc., and when the acquired information is voice, the text information corresponding to the voice is acquired in the information classification process, specifically, the voice is converted into the text information through a voice text recognition plug-in.

The extracting module 82 is configured to extract a keyword information set from the text information to be classified according to a preset rule, where the keyword information set includes at least one keyword information.

The matching module 83 is configured to match text category information corresponding to the keyword information set according to the keyword information set and a preset correspondence between the sample keyword information set and the text category information.

The classification module 84 is configured to classify the text information to be classified according to the matched text category information.

In this embodiment, the extracting module 82 performs word segmentation on the text information to be classified according to the word segmentation processing technology after extracting the keyword information set of the text information to be classified, segments the text information to be classified into at least one keyword information, and collects the segmented keyword information to form the keyword information set of the text information to be classified. Preferably, punctuation marks in the text information to be classified are removed, and then keyword segmentation is performed according to the original sequence of the text information to be classified.

In this embodiment, the above modules may be integrated with a processor on the mobile terminal, and the processor is divided into modules having the above functions by software.

In this embodiment, the apparatus further includes a correspondence establishing module, configured to classify a plurality of sample text information obtained in advance, and extract keyword information of the sample text information of each text category after classification, to form a sample keyword information set; and establishing a corresponding relation between the sample keyword information set extracted from the sample text information of the same text category and the text category information.

In this embodiment, when matching text category information corresponding to keyword information of text information to be classified, the matching module 83 specifically queries whether each preset sample keyword information set has the keyword information of the text information to be classified according to the keyword information, if so, the keyword corresponding to the preset sample keyword information set currently being queried is marked as "present", the other keywords are marked as "absent", finally, after the query is completed, a character string is output, the corresponding text category information is identified according to the marking, specifically, the output character string is analyzed according to the identification result, and the corresponding text category information is obtained according to the analysis result, preferably, the character string is analyzed specifically and classified according to the number of marks "present" in the keyword information set of each text category, and the more marks are more likely.

In this embodiment, the matching module 83 is specifically configured to match each keyword information in the keyword information set with a sample keyword information set corresponding to each preset text category information, so as to obtain an original first character string corresponding to each sample keyword information set one to one or obtain an original second character string formed by arranging each original first character string according to a preset sequence; the original first character string comprises characters 0 and/or characters 1, the position sequence of each character 0 and 1 corresponds to the position sequence of each keyword information of each text category in a corresponding sample keyword information set one by one, the character 0 indicates that the keyword information of the text information to be classified does not exist in the sample keyword information set, and the character 1 indicates that the keyword information of the text information to be classified exists in the sample keyword information set; and identifying text category information corresponding to the keyword information set according to the obtained original character string.

In this embodiment, the system further comprises a correction module;

the matching module 83 is configured to match the keyword information with a preset sample keyword information set, and obtain an original character string of the keyword information set of the text information to be classified, where the original character string includes an original first character string or an original second character string;

The correction module is used for correcting the obtained original character string according to the classification model obtained by pre-learning to obtain a final character string, and replacing the original character string by the final character string;

the matching module 83 obtains corresponding text category information according to the final character string.

In this embodiment, the correction module is further configured to acquire a classification model through training and learning of a model in advance, specifically, the correction module acquires a keyword information set of sample text information of each text category; matching each keyword information set with a sample keyword information set of a corresponding text category, and outputting a corresponding character string; performing model training learning on the character strings according to a preset training learning algorithm to obtain a classification model; and establishing a corresponding relation between the classification model and the text classification information.

In this embodiment, the correction module generates different classification models according to the output character strings of each category, for example, three categories of "banking transaction", "meal office" and "engineering project" are provided in this embodiment, and then the correction module forms three random forest classification models when performing model learning, respectively. In this way, the classification model can be used for analysis of new text information and case information. Information about the case and information about the classification models of the several classes are automatically identified by the system.

In the text information classification analysis mode, single text analysis can be performed according to classification models of various categories. In analyzing a single text, a user may input a single piece of text information, and analyze the single piece of text information.

In this embodiment, batch text analysis may be performed, and when batch text is analyzed, a batch text file to be analyzed is uploaded, and classification analysis and category correlation analysis are performed on the text.

The analysis of various classifications by adopting various classification models, such as classification analysis, for example, whether the classification is bank transaction or not, etc., can be suitable for various application scenes.

And storing the data analyzed in batches, and downloading analysis reports. The user can conveniently obtain the classification condition of the analyzed batch text.

In this embodiment, each function implemented by each module of the above text information classification system may be implemented by means of program codes, specifically, a processor on a terminal reads a pre-stored code for implementing text information classification from a memory, and compiles and executes the code to implement information acquisition and classification.

In summary, according to the text information classification method and the device thereof provided by the embodiment of the invention, through presetting the sample keyword information set of the text category and establishing the corresponding relation between the sample keyword information set of the text category and the text category information, a matching basis is provided for classifying the text information to be classified later, and the possibility is provided for realizing automatic matching classification; when the text information to be classified is classified, keyword information is extracted from the text information to be classified according to a preset rule, text category information corresponding to the text information to be classified is matched according to the corresponding relation between a preset sample keyword information set and the text category, and the text category information is matched through keywords, so that the operation steps of classifying the information are simplified, the working efficiency of classifying is greatly improved, and meanwhile, the classifying accuracy is also improved.

It will be appreciated by those skilled in the art that the modules or steps of the embodiments of the invention described above may be implemented in a general purpose computing device, they may be centralized on a single computing device, or distributed over a network of computing devices, or they may alternatively be implemented in program code executable by computing devices, such that they are stored in a computer storage medium (ROM/RAM, magnetic or optical disk) and, in some cases, the steps shown or described may be performed in a different order than what is shown or described, or they may be separately fabricated into individual integrated circuit modules, or a plurality of modules or steps in them may be fabricated into a single integrated circuit module. Therefore, the present invention is not limited to any specific combination of hardware and software.

The foregoing is a further detailed description of embodiments of the invention in connection with the specific embodiments, and it is not intended that the invention be limited to the specific embodiments described. It will be apparent to those skilled in the art that several simple deductions or substitutions may be made without departing from the spirit of the invention, and these should be considered to be within the scope of the invention.

Claims

1. A text information classification method, comprising:

acquiring text information to be classified;

classifying the text information to be classified according to the matched text category information;

the extracting the keyword information set from the text information to be classified according to the preset rule comprises the following steps:

dividing words of the text information to be classified according to a word segmentation processing technology;

after determining that the segmentation is completed, obtaining a plurality of keyword information;

Extracting and screening the plurality of keyword information, and selecting keywords which can best represent the category of the text information to be classified to form the keyword information set;

the matching of the text category information corresponding to the keyword information set according to the keyword information set and the corresponding relation between the preset sample keyword information set and the text category information comprises the following steps:

matching each keyword information in the keyword information set with a sample keyword information set corresponding to each preset text category information to obtain an original first character string corresponding to each sample keyword information set one by one; the original first character string comprises characters 0 and/or characters 1, the position sequence of each character 0 and 1 corresponds to the position sequence of each keyword information of each text category in a corresponding sample keyword information set one by one, the character 0 indicates that the keyword information of the text information to be classified does not exist in the sample keyword information set, and the character 1 indicates that the keyword information of the text information to be classified exists in the sample keyword information set;

and identifying text category information corresponding to the keyword information set according to the obtained original character string.

2. The text information classification method according to claim 1, wherein the extracting the keyword information set from the text information to be classified according to the preset rule comprises: and after eliminating punctuation marks in the text information to be classified, carrying out keyword segmentation according to the original sequence of the content of the text information to be classified, and obtaining at least one keyword information by segmentation.

3. The text information classification method according to claim 1, further comprising acquiring correspondence of the sample keyword information set and text category information by:

classifying a plurality of sample text information obtained in advance, extracting keyword information of each sample text information in each text category after classification, and forming the sample keyword information set;

and establishing a corresponding relation between a sample keyword information set extracted from sample text information of the same text category and the text category information.

4. A method for classifying text information according to any one of claims 1 to 3, wherein the matching text category information corresponding to the keyword information set according to the keyword information set and a correspondence between a preset sample keyword information set and text category information includes:

Matching each keyword information in the keyword information set with a sample keyword information set corresponding to each preset text category information to obtain an original second character string formed by arranging each original first character string according to a preset sequence;

5. The method for classifying text information according to claim 4, further comprising, after said obtaining said original first character string or said original second character string, before said identifying text category information corresponding to said keyword information set based on said obtained original character string:

6. A text information classification apparatus comprising: the device comprises an acquisition module, an extraction module, a matching module and a classification module;

the acquisition module is used for acquiring text information to be classified;

the classification module is used for classifying the text information to be classified according to the matched text category information;

the extraction module is specifically configured to:

the matching module is used for matching each keyword information in the keyword information set with a sample keyword information set corresponding to each preset text category information to obtain an original first character string corresponding to each sample keyword information set one by one; the original first character string comprises characters 0 and/or characters 1, the position sequence of each character 0 and 1 corresponds to the position sequence of each keyword information of each text category in a corresponding sample keyword information set one by one, the character 0 indicates that the keyword information of the text information to be classified does not exist in the sample keyword information set, and the character 1 indicates that the keyword information of the text information to be classified exists in the sample keyword information set; and identifying text category information corresponding to the keyword information set according to the obtained original character string.

7. The text information classification apparatus according to claim 6, wherein the extraction module is configured to perform keyword segmentation according to an original sequence of contents of the text information to be classified after removing punctuation marks in the text information to be classified, so as to obtain at least one keyword information.

8. The text information classification apparatus of claim 6, further comprising: the corresponding relation establishing module is used for classifying a plurality of sample text information obtained in advance, extracting keyword information of the sample text information of each text category after classification, and forming a sample keyword information set; and establishing a corresponding relation between the sample keyword information set extracted from the sample text information of the same text category and the text category information.

9. The text information classification apparatus according to any one of claims 6 to 8, wherein the matching module is configured to match each keyword information in the keyword information set with a sample keyword information set corresponding to each preset text category information, so as to obtain a primary second string composed of primary first strings arranged in a preset order; and identifying text category information corresponding to the keyword information set according to the obtained original character string.

10. The text information classification apparatus as claimed in claim 9, further comprising: and the correction module is used for correcting the obtained original character string according to the classification model obtained by pre-learning to obtain a final character string, and replacing the original character string by the final character string.