CN110888977A

CN110888977A - Text classification method and device, computer equipment and storage medium

Info

Publication number: CN110888977A
Application number: CN201811031765.6A
Authority: CN
Inventors: 方建生
Original assignee: Guangzhou Shiyuan Electronics Thecnology Co Ltd
Current assignee: Guangzhou Shiyuan Electronics Thecnology Co Ltd
Priority date: 2018-09-05
Filing date: 2018-09-05
Publication date: 2020-03-17

Abstract

The application relates to a text classification method, a text classification device, a computer device and a storage medium. The method comprises the following steps: acquiring a text to be classified, and extracting entries of the text to be classified; searching a preset classification table according to the entries of the text to be classified, acquiring the frequency corresponding to the entries of the text to be classified in the preset classification table, and determining the type of the text to be classified according to the frequency corresponding to the entries of the text to be classified; and searching in a preset classification model according to the type of the text to be classified, and classifying the text to be classified under the corresponding entries of the preset classification model. The method comprises the steps of searching a preset classification table according to the extracted entries of the texts to be classified, determining the types of the texts to be classified, determining a retrieval range, searching in a preset classification model according to the types of the texts to be classified, and accurately classifying the texts to be classified, so that the texts can be automatically qualified and classified, and the classification efficiency is improved.

Description

Text classification method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of text processing technologies, and in particular, to a text classification method, apparatus, computer device, and storage medium.

Background

With the development of the text processing technology, more and more enterprises carry out qualitative classification and classification on a large amount of text data through the text processing technology, the enterprises can accumulate a large amount of complaint work orders in the long-term product and service providing process, the text content on the complaint work orders is the most direct information fed back by customers in the product and service using process, and the complaint work orders are qualitatively and classified through analyzing the text content of the complaint work orders, so that valuable references can be provided for the product and service quality improvement of the enterprises.

However, the conventional text classification method is implemented by retrieving manually defined keywords from text contents, and the maintenance and selection of the keywords require manual operation, which causes problems of large workload, low classification efficiency, and the like.

Disclosure of Invention

In view of the above, it is necessary to provide a text classification method, apparatus, computer device and storage medium capable of improving classification efficiency.

A method of text classification, the method comprising:

acquiring a text to be classified, analyzing the text to be classified, and extracting entries of the text to be classified;

searching a preset classification table according to the entries of the text to be classified, acquiring the frequency corresponding to the entries of the text to be classified in the preset classification table, and determining the type of the text to be classified according to the frequency corresponding to the entries of the text to be classified;

searching in a preset classification model according to the type of the text to be classified, and classifying the text to be classified under corresponding entries of the preset classification model;

the preset classification model is obtained by designing the historical classified text according to the frequency corresponding to the entries of the historical classified text and is used for representing the classification standard of the entries.

In one embodiment, before the steps of obtaining a text to be classified, analyzing the text to be classified, and extracting entries of the text to be classified, the method further includes:

acquiring a historical classified text, analyzing the historical classified text, and extracting entries of the historical classified text and corresponding frequency of the entries of the historical classified text;

constructing a preset classification table according to the entries of the historical classified texts and the frequency corresponding to the entries of the historical classified texts;

and designing a preset classification model according to the frequency corresponding to the entries of the historical classification text.

In one embodiment, the preset classification table is a secondary hash table, and the constructing of the preset classification table according to the entries of the historical classified texts and the frequency corresponding to the entries of the historical classified texts includes:

extracting word length corresponding to the entries and frequency levels corresponding to the entries according to the entries of the historical classified texts and the frequency corresponding to the entries of the historical classified texts;

and establishing primary hash function mapping by taking the word length corresponding to the entry as an independent variable of a first-stage function, and establishing secondary hash function mapping by taking the frequency level corresponding to the entry as an independent variable of a second-stage function to obtain a secondary hash table.

In one embodiment, the preset classification model is a B-tree, and designing the preset classification model according to the frequency corresponding to the entries of the history classified text includes:

extracting the number of the entries according to the corresponding frequency of the entries of the historical classified text;

determining the number of keywords on the node of the B tree according to the number of the entries;

and designing to obtain the B tree according to the number of the keywords.

In one embodiment, acquiring a history classified text, analyzing the history classified text, and extracting entries of the history classified text and corresponding frequencies of the entries of the history classified text, the method includes:

acquiring a historical classified text, analyzing the historical classified text to obtain initial terms of the historical classified text and initial frequency corresponding to the initial terms of the historical classified text;

performing semantic recognition on the extracted initial entries of the historical classified texts to obtain recognition results, and determining an experience threshold of initial frequency according to the recognition results;

removing initial entries with initial frequency lower than an experience threshold value and corresponding initial frequency in the historical classified text;

and obtaining the entries of the historical classified texts and the corresponding frequency of the entries of the historical classified texts according to the initial entries and the corresponding initial frequency after the initial entries are removed from the historical classified texts.

In one embodiment, obtaining the entries of the history classified text and the corresponding frequency of the entries of the history classified text according to the removed initial entries and the corresponding initial frequency in the history classified text includes:

judging whether the entries with the inclusion relationship exist in the removed entries in the historical classified text, wherein the inclusion relationship is that the characters of the first entry contain the characters of the second entry, the first entry is a parent entry, and the second entry is a child entry;

if yes, when the characters in the sub-entry comprise the first characters and the last characters of the parent entry, eliminating the corresponding frequency of the sub-entry and the sub-entry to obtain the entry of the historical classified text and the corresponding frequency of the entry of the historical classified text.

In one embodiment, parsing the history classified text to obtain an initial entry of the history classified text and an initial frequency corresponding to the initial entry of the history classified text includes:

obtaining the sentence length corresponding to the historical classified text;

extracting original entries and corresponding original frequency of the historical classified text according to the sentence length;

and removing the original entries meeting the preset removing conditions in the original entries and the corresponding original frequency to obtain the initial entries of the historical classified texts and the initial frequency corresponding to the initial entries of the historical classified texts.

An apparatus for text classification, the apparatus comprising:

the acquisition module is used for acquiring the text to be classified, analyzing the text to be classified and extracting entries of the text to be classified;

the qualitative module is used for searching a preset classification table according to the entries of the texts to be classified, acquiring the frequency corresponding to the entries of the texts to be classified in the preset classification table, and determining the types of the texts to be classified according to the frequency corresponding to the entries of the texts to be classified;

the classification module is used for searching in a preset classification model according to the type of the text to be classified and classifying the text to be classified under corresponding entries of the preset classification model;

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

The text classification method, the text classification device, the computer equipment and the storage medium acquire the text to be classified, and extract entries of the text to be classified; searching a preset classification table according to the entries of the text to be classified, acquiring the frequency corresponding to the entries of the text to be classified in the preset classification table, and determining the type of the text to be classified according to the frequency corresponding to the entries of the text to be classified; and searching in a preset classification model according to the type of the text to be classified, and classifying the text to be classified under the corresponding entries of the preset classification model. The method comprises the steps of searching a preset classification table according to the extracted entries of the texts to be classified, determining the types of the texts to be classified, determining a retrieval range, searching in a preset classification model according to the types of the texts to be classified, and accurately classifying the texts to be classified, so that the texts can be automatically qualified and classified, and the classification efficiency is improved.

Drawings

FIG. 1 is a flow diagram that illustrates a method for text classification in one embodiment;

FIG. 2 is a flowchart illustrating a text classification method according to another embodiment;

FIG. 3 is a flowchart illustrating a text classification method step S100 according to an embodiment;

FIG. 4 is a flowchart illustrating a text classification method step S100 according to another embodiment;

FIG. 5 is a flowchart illustrating a text classification method step S100 according to another embodiment;

FIG. 6 is a flowchart illustrating a text classification method according to another embodiment;

FIG. 7 is a diagram of a secondary hash table of the text classification method in one embodiment;

FIG. 8 is a flowchart illustrating a text classification method according to still another embodiment;

FIG. 9 is a diagram of a text classification method B-tree in one embodiment;

FIG. 10 is a block diagram showing the structure of a text classification device in one embodiment;

FIG. 11 is a block diagram showing the construction of a text classification device in another embodiment;

FIG. 12 is a block diagram of a first parsing module of the text classification device in one embodiment;

FIG. 13 is a block diagram of a first parsing module of the text classification apparatus in another embodiment;

FIG. 14 is a block diagram showing the structure of a first parsing module of the text classification device in another embodiment;

FIG. 15 is a block diagram showing the construction of a text classification device building block in one embodiment;

FIG. 16 is a block diagram showing the structure of a design block of the text classification device in one embodiment;

FIG. 17 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The text classification method provided by each embodiment of the application can be applied to classification of various texts, particularly complaint work orders accumulated by enterprises in long-term product and service processes, entries in the text classification method can be stored in a server, the server for storing the entries can be realized by an independent server or a server cluster consisting of a plurality of servers, the text classification can be realized by a specific terminal, and the terminal can be but is not limited to various personal computers, notebook computers, smart phones and tablet computers.

In one embodiment, as shown in fig. 1, there is provided a text classification method, including the steps of:

and S400, acquiring a text to be classified, analyzing the text to be classified, and extracting entries of the text to be classified.

The text to be classified can be a text containing the contents of sentences, paragraphs and the like of the extractable vocabulary entry, the vocabulary entry of the text to be classified can be extracted by analyzing the text to be classified, and specifically, the vocabulary entry which can be used for accurately determining the nature and classifying the text to be classified can be extracted by analyzing the text to be classified through a Hadoop platform or a real-time Storm platform.

Step S500, searching a preset classification table according to the entries of the texts to be classified, acquiring the frequency corresponding to the entries of the texts to be classified in the preset classification table, and determining the types of the texts to be classified according to the frequency corresponding to the entries of the texts to be classified.

Analyzing a text to be classified, extracting to obtain entries of the text to be classified, searching a preset classification table according to the extracted entries, wherein the preset classification table is obtained by constructing the entries and corresponding frequencies of the text to be classified according to history, and is used for representing the corresponding relation between the entries and the frequencies; when the entry corresponding to the entry of the text to be classified cannot be found in the preset classification table, the entry cannot be used as a basis for qualitative determination of the text to be classified.

Step S600, searching in a preset classification model according to the type of the text to be classified, and classifying the text to be classified under corresponding entries of the preset classification model.

The method comprises the steps of analyzing a text to be classified, extracting to obtain entries of the text to be classified, searching in a preset classification table according to the entries of the text to be classified, determining the type of the text to be classified, searching in a preset classification model according to the type of the text to be classified, designing the historical classified text according to the frequency corresponding to the entries of the historical classified text by the preset classification model, representing the classification standard of the entries, and classifying the text to be classified under the corresponding entries when the type of the text to be classified can be found in the preset classification model. In one embodiment, the preset classification model is further configured to count the number of texts to be classified of the same type, that is, when the type of the text to be classified is found in the preset classification model, the text to be classified is classified under the corresponding entry, and the number of the texts of the type is counted, so that the user can obtain the total number of the texts of the type, and if the text to be classified is a complaint worksheet of an enterprise in a long-term product and service providing process, the user can take corresponding measures for products and services according to the number of the texts of the type.

The text classification method comprises the steps of obtaining a text to be classified, and extracting entries of the text to be classified; searching a preset classification table according to the entries of the text to be classified, acquiring the frequency corresponding to the entries of the text to be classified in the preset classification table, and determining the type of the text to be classified according to the frequency corresponding to the entries of the text to be classified; and searching in a preset classification model according to the type of the text to be classified, and classifying the text to be classified under the corresponding entries of the preset classification model. The method comprises the steps of searching a preset classification table according to the extracted entries of the texts to be classified, determining the types of the texts to be classified, determining a retrieval range, searching in a preset classification model according to the types of the texts to be classified, and accurately classifying the texts to be classified, so that the texts can be automatically qualified and classified, and the classification efficiency is improved.

In one embodiment, as shown in fig. 2, before step S400, step S100 to step S130 are further included.

And S100, acquiring the historical classified text, analyzing the historical classified text, and extracting the entries of the historical classified text and the corresponding frequency of the entries of the historical classified text.

The historical classified text is used for constructing a preset classification table and designing text data of a preset classification model, the historical classified text is analyzed, entries of the historical classified text and corresponding frequency of the entries of the historical classified text are extracted, similarly, the historical classified text can be analyzed through a Hadoop platform or a real-time Storm platform and other big data platforms, and the core design of the Hadoop platform is an HDFS (Hadoop Distributed File System) storage system and a MapReduce computing frame. The HDFS provides storage for massive data, and the MapReduce provides calculation for massive data. And training qualitative and classified entries and frequency thereof from historical classified texts by using the HDFS storage capacity and the MapReduce calculation framework of the Hadoop platform, wherein the Hadoop platform is good at off-line analysis, and can update and maintain the entries and the frequency based on the distributed processing capacity of the Hadoop platform. The real-time Storm platform is mainly used for real-time analysis and applied to scenes with high requirements on analysis timeliness, core codes of the real-time Storm platform are developed by using a clojeure functional programming language, qualitative and classified entries and frequency thereof can be trained from historical classified texts, and dynamic maintenance and updating of the entries and the frequency thereof can be acquired in real time from the classified texts. Based on the distributed storage and calculation capacity of the big data platform, the historical classified text is analyzed through the big data platform, the entries of the historical classified text and the corresponding frequency of the entries of the historical classified text are extracted, and massive text data can be processed quickly.

And S200, constructing a preset classification table according to the entries of the historical classified texts and the corresponding frequency of the entries of the historical classified texts.

The method comprises the steps of analyzing historical classified texts, extracting and obtaining entries of the historical classified texts and corresponding frequencies of the entries of the historical classified texts, and then constructing a preset classification table according to the entries of the historical classified texts and the corresponding frequencies of the entries of the historical classified texts, wherein the preset classification table is not unique in form, but needs to be used for representing corresponding relations between the entries and the frequencies, and in one embodiment, the preset classification table is a secondary hash table. Due to the fact that the number of the entries of the extracted historical classified text is large, the search efficiency can be improved by constructing the preset classification table, and the text to be classified can be rapidly qualified.

And step S300, designing a preset classification model according to the frequency corresponding to the vocabulary entry of the historical classification text.

After the entries of the historical classified text and the corresponding frequency of the entries of the historical classified text are extracted and obtained through analyzing the historical classified text, a preset classification model is designed according to the corresponding frequency of the entries of the historical classified text, the type of the preset classification model is not unique, but the preset classification model needs a classification standard for representing the entries, and in one embodiment, the preset classification model is a B tree. The preset classification model is also used for counting the number of texts corresponding to the similar entries, and counting the texts to be classified at the same time, so that a user can know the total number of the similar texts through the preset classification model.

In one embodiment, as shown in fig. 3, step S100 includes steps S110 to S140.

Step S110, obtaining the historical classified text, analyzing the historical classified text to obtain the initial terms of the historical classified text and the initial frequency corresponding to the initial terms of the historical classified text.

After the historical classified text is obtained, analyzing the historical classified text, wherein the analyzing process is a training process, and mainly comprises the steps of finding meaningful words and sentences from character strings of the historical classified text, counting the frequency corresponding to the terms, and screening a part of meaningless terms by analyzing the historical classified text to obtain initial terms of the historical classified text and initial frequency corresponding to the initial terms of the historical classified text.

Step S120, semantic recognition is carried out on the extracted initial terms of the historical classified text to obtain recognition results, and experience threshold values of initial frequency are determined according to the recognition results.

After the initial entries of the historical classified texts are extracted, an open-source Chinese analyzer is used for identifying the entry semantics of the initial entries, the initial entries with the minimum frequency are identified based on the insights of low-frequency entries, the experience threshold of the initial frequency is determined according to the identification result, the experience threshold is used as the limit of meaning of the entries, the initial entries can be further screened, and the qualitative property of a preset classification table and the classification accuracy of a preset classification model are further improved.

Step S130, eliminating initial entries and corresponding initial frequencies, wherein the initial frequencies in the historical classified texts are lower than an experience threshold.

The method comprises the steps of identifying entry semantics of an initial entry, determining an experience threshold, then further screening the initial entry and corresponding initial frequency through the experience threshold, judging whether the initial frequency corresponding to the initial entry is lower than the experience threshold, removing the initial entry and the corresponding initial frequency when the initial frequency corresponding to the initial entry is lower than the experience threshold, and keeping the initial entry and the corresponding initial frequency when the initial frequency corresponding to the initial entry is greater than or equal to the experience threshold.

And step S140, obtaining the entries of the historical classified texts and the corresponding frequency of the entries of the historical classified texts according to the initial entries and the corresponding initial frequency after the initial entries are removed from the historical classified texts.

And after further screening the initial terms and the corresponding initial frequencies according to the determined experience threshold, directly obtaining the terms of the historical classified text and the frequencies corresponding to the terms of the historical classified text according to the initial terms and the corresponding initial frequencies after the initial terms and the corresponding initial frequencies are removed from the historical classified text. In one embodiment, the user may further screen the removed initial terms and the corresponding initial frequencies as needed, so as to further improve the accuracy of the text to be classified according to the preset classification table and the accuracy of the text to be classified according to the preset classification model.

The process of obtaining the entries of the historical classified text and the corresponding frequency of the entries of the historical classified text from the initial entries and the corresponding initial frequency of the historical classified text can be understood as a process of training the entries from the historical classified text, which is mainly to find meaningful words and sentences from character strings, and when the historical classified text is trained through a Hadoop platform, the frequency of the entries is counted through a < key, value > mechanism of MapReduce. The algorithm for finding meaningful entries in the character strings mainly extracts ordered substrings, wherein the ordered substrings refer to the substrings with different lengths extracted according to the sequence of words in a sentence without reversing the sequence and adjusting the adjacent positions of the words in the sentence. The historical classified texts are trained, a classified entry library can be established in an automatic machine learning mode and optimized in real time without manual intervention, and convenience of data processing is improved.

In one embodiment, as shown in FIG. 4, step S110 includes step S112, step S114, and step S116.

In step S112, the sentence length corresponding to the history classified text is acquired.

After the history classified text is obtained, the corresponding sentence length, namely the total number n of all characters including punctuation marks in the history classified text content, is obtained from the history classified text.

And step S114, extracting the original entries and the corresponding original frequency of the history classified texts according to the sentence length.

And extracting original terms and corresponding original frequencies of the historical classified text according to the sentence length n of the obtained historical classified text, wherein the original terms and the corresponding original frequencies are the terms and the corresponding frequencies which are not subjected to any eliminating treatment. In one embodiment, for a string of length n, the number of ordered substrings is: 1 substring with the length of n, 2 string with the length of n-1, and so on, n-1 string with the length of 2, n string with the length of 1, and the minimum length of the entries is 2 and the longest n-1, then the maximum number of the entries extracted from one historical classified text content is:

and step S116, removing the original entries meeting the preset removing conditions in the original entries and the corresponding original frequency to obtain the initial entries of the historical classified texts and the initial frequency corresponding to the initial entries of the historical classified texts.

And after extracting the original entries and the corresponding original frequencies of the historical classified texts according to the sentence lengths, removing the original entries and the corresponding original frequencies which accord with preset removing conditions in the original entries, wherein the preset removing conditions comprise at least one of the entries with the word length of 1 and the sentence length of n, the entries containing punctuation marks, the entries containing dummy words and repeated entries.

In one embodiment, as shown in FIG. 5, step S140 includes step S142 and step S144.

And step S142, judging whether the entries with the inclusion relation exist in the removed entries in the historical classified text.

In order to ensure that the terms used for qualitative determination and classification are effective, the removed terms in the historical classified text are further screened, whether the terms with inclusion relation exist in the removed terms in the historical classified text is judged, if the inclusion relation is that the characters of the first term comprise the characters of the second term, the first term is a parent term, and the second term is a child term is included.

Step S144, when the characters in the sub-entry comprise the first character and the last character of the parent entry, eliminating the corresponding frequency of the sub-entry and the sub-entry to obtain the entry of the historical classified text and the corresponding frequency of the entry of the historical classified text.

When the entry with the inclusion relation exists in the removed entries in the historical classified text, judging whether characters in the sub-entries contain the first characters and the last characters of the sub-entries in the entry with the inclusion relation, when the characters in the sub-entries contain the first characters and the last characters of the parent entry, removing the corresponding frequency of the sub-entries and the sub-entries, using the frequency corresponding to the remaining entries of the historical classified text and the entries of the historical classified text as the qualitative and classified entries and frequency, and when the characters in the sub-entries do not contain the first characters and the last characters of the parent entry or do not completely contain the first characters and the last characters of the parent entry, directly keeping the sub-entries and the corresponding frequency, and the parent entries and the corresponding frequency. And when judging that no entry with an inclusion relation exists in the removed entries in the historical classified text, directly constructing a preset classification table and a preset classification model according to the removed entries in the historical classified text and the corresponding frequency.

Taking a sentence of a certain historical classified text as 'broadband fault, unable connection and requiring to be repaired as soon as possible' as an example, the term is extracted. The length of the sentence is 16, then the extracted terms are as follows:

the entry with the word length of 2 comprises: broadband, fault, failure, legal, connection, requirement, exhaustion, quick repair and repair

The entry with the word length of 3 comprises: broadband fault, belt fault, failure connection, legal connection, requirement exhaustion, quick repair and quick repair

The entry with the word length of 4 comprises: broadband failure, failure to connect, demand, repair, or repair as soon as possible

The entry with the word length of 5 comprises: require repair as soon as possible

The entry with the word length of 6 comprises: require repair as soon as possible

In the entry result obtained by extraction, entries having a word length of 1 and a sentence length of 16 have already been removed, and entries having word lengths of 7, 8, 9, 10, 11, 12, 13, 14, and 15 are all entries including punctuation marks, and therefore removal is also required. Assuming that entries are extracted from 600 tens of thousands of complaint work order text contents of 200 words in average length, there will be about 1200 billion entries (200 words each in worst case) in the worst case, and the storage is nearly 45T. For the text content of each text, the complexity of the algorithm for acquiring the entry is O (n)²) N is a constant and is not sufficient to be a constraint in performance. And training the extracted entries as meaningful entries, namely training a semi-supervised mechanism. The main purpose of training is to remove pseudo entries with the same frequency order magnitude, and ensure that the entries used in the qualitative and classification are valid. Taking the term "with the cause" extracted in the above example as an example, the term "with the cause" is a meaningless pseudo term, but since the "wideband fault" includes the "with the cause", it will be removed in the training.

In one embodiment, steps S112, S114 and S116 may be implemented by the following algorithmic pseudo code:

in one embodiment, the text to be classified is obtained in step S400, the text to be classified is analyzed, and the entry of the extracted text to be classified can be removed by using the above entry extraction method, and finally, the text to be classified is qualitatively and classified by removing the entry after the nonsense entry, so that the efficiency of qualitatively and classifying the text to be classified can be improved.

In one embodiment, the preset classification table is a two-level hash table, as shown in fig. 6, the step S200 includes a step S210 and a step S220.

Step S210, extracting word length corresponding to the entries and frequency levels corresponding to the entries according to the entries of the historical classified texts and the frequency corresponding to the entries of the historical classified texts.

After the entries of the historical classified text and the corresponding frequency of the entries of the historical classified text are extracted and obtained by analyzing the historical classified text, the word length corresponding to the entries and the frequency level corresponding to the entries are extracted according to the entries of the historical classified text and the corresponding frequency of the entries of the historical classified text, wherein the word length corresponding to the entries is the number of characters contained in each entry, for example, if a certain entry contains 5 characters, the word length of the entry is 5. The frequency level corresponding to the vocabulary entry is a level established according to the frequency corresponding to the vocabulary entry, namely a frequency range set is established from the lowest frequency to the highest frequency, and the number of the frequency range sets can be flexibly set according to the lowest frequency and the highest frequency.

Step S220, establishing a primary hash function mapping by taking the word length corresponding to the entry as the independent variable of the first-stage function, and establishing a secondary hash function mapping by taking the frequency level corresponding to the entry as the independent variable of the second-stage function to obtain a secondary hash table.

According to the entries of the historical classified texts and the corresponding frequency of the entries of the historical classified texts, after word lengths corresponding to the entries and frequency levels corresponding to the entries are extracted, primary hash function mapping is established by taking the word lengths corresponding to the entries as independent variables of a first-level function, secondary hash function mapping is established by taking the frequency levels corresponding to the entries as independent variables of a second-level function, and the entries subjected to frequency statistics are stored in an ordered linked list from high to low in frequency to obtain a secondary hash table. In one embodiment, the second-level hash table constructed by the word length corresponding to the entry and the frequency level corresponding to the entry is shown in fig. 7. The time complexity of hash table retrieval is constant O (1) at the first level, and the second level of frequency location and ordered linked list search can be converted into a linear search for all the key numbers n, namely O (n).

In one embodiment, when the type of the text to be classified needs to be determined, firstly, the vocabulary entry of the text to be classified is extracted, the first level is found in the hash table according to the word length corresponding to the vocabulary entry, the retrieval is started from the highest frequency in the second level, if the vocabulary entry corresponding to the vocabulary entry of the text to be classified is found in the corresponding frequency level set, the vocabulary entry found in the hash table is used as the type of the text with classification, and the corresponding frequency of the vocabulary entry in the hash table is obtained.

In one embodiment, the classification model is preset as a B-tree, as shown in fig. 8, and step S300 includes steps S310 to S330.

Step S310, extracting the number of the entries according to the corresponding frequency of the entries of the history classified text.

And according to the frequency corresponding to the entries of the historical classified text, designing a B tree by sorting the entry frequency so as to reduce the influence of the IO performance of the disk caused by page exchange. Specifically, the number of terms is extracted first, and the sum of the number of all terms extracted from all history classified texts is denoted by n.

Step S320, determining the number of keywords on the node of the B tree according to the number of the entries.

Determining the number of keywords on a node of a B tree, namely t degrees, according to the number n of entries and the sizes of a memory and a page file of computer equipment, wherein one node of the B tree is stored in a page file; and the height h of the tree is log_tn。

And step S330, designing a B tree according to the number of the keywords.

After the number of the keywords on the node of the B tree is determined, the B tree can be designed. In one embodiment, as shown in fig. 9, considering that the size of the page file is 4KB, if the average length of each entry is 200 words, t ═ 10 can be set as one node degree, the number of keys on the node is t-1 except for the root node and the leaf nodes, and the key value of the child node is exactly between the key values of the parent node i and j. The number of keywords on the nodes is set to be smaller, so that the retrieval with higher frequency is facilitated and faster. The B-tree search proceeds from the root node all the way down, and the time complexity is the disk access times O (h) and the CPU running time O (t × h).

In one embodiment, when the text to be classified needs to be classified, firstly, the entries of the text to be classified are extracted, then, the search is performed in the node of the B tree according to the entries, when the corresponding entries are found in the node of the B tree, the text to be classified is classified under the entries, the frequency corresponding to the entries is obtained, and the obtained frequency is the current total number of the text. The texts to be classified are automatically qualified and classified through the hash table and the B tree, so that the text classification efficiency can be improved, and the classification accuracy is high.

It should be understood that although the various steps in the flowcharts of fig. 1-6 and fig. 8 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 1-6 and 8 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least some of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 10, there is provided a text classification apparatus including: an acquisition module 400, a qualitative module 500, and a classification module 600, wherein:

the obtaining module 400 is configured to obtain a text to be classified, analyze the text to be classified, and extract entries of the text to be classified.

The text to be classified can be text content with entries extracted from sentences, paragraphs and the like, and the entries of the text to be classified can be extracted by analyzing the text to be classified.

The qualitative module 500 is configured to search a preset classification table according to the entries of the text to be classified, obtain the frequency corresponding to the entries of the text to be classified in the preset classification table, and determine the type of the text to be classified according to the frequency corresponding to the entries of the text to be classified.

When the entry corresponding to the entry of the text to be classified can be found in the preset classification table, acquiring the corresponding frequency of the entry of the text to be classified in the preset classification table, and determining the type of the text to be classified according to the acquired frequency; when the entry corresponding to the entry of the text to be classified cannot be found in the preset classification table, the entry cannot be used as a basis for qualitative determination of the text to be classified.

The classification module 600 is configured to search in a preset classification model according to the type of the text to be classified, and classify the text to be classified under a corresponding entry of the preset classification model;

After the type of the text to be classified is determined, searching is carried out in a preset classification model according to the type of the text to be classified, and when the type of the text to be classified can be found in the preset classification model, the text to be classified is classified under a corresponding entry. In an embodiment, the preset classification model is further configured to count the number of the texts to be classified of the same type, that is, when the type of the text to be classified is found in the preset classification model, the text to be classified is classified under the corresponding entry, and meanwhile, the number of the texts of the type is counted, so that the user can obtain the total number of the texts of the type

In one embodiment, as shown in fig. 11, the text classification apparatus further includes: a first parsing module 100, a building module 200 and a design module 300.

The first parsing module 100 is configured to obtain a history classified text, parse the history classified text, and extract entries of the history classified text and corresponding frequencies of the entries of the history classified text.

The historical classified text is text data used for constructing a preset classification table and designing a preset classification model, the historical classified text is analyzed, the entries of the historical classified text and the corresponding frequency of the entries of the historical classified text are extracted, and similarly, the historical classified text can be analyzed through a Hadoop platform or a real-time Storm platform and other big data platforms.

The building module 200 is configured to build a preset classification table according to the entries of the history classification text and the frequency corresponding to the entries of the history classification text.

And constructing a preset classification table according to the entries of the historical classified texts and the corresponding frequency of the entries of the historical classified texts, wherein the form of the preset classification table is not unique, but the preset classification table is required to be used for representing the corresponding relation between the entries and the frequency. Due to the fact that the number of the entries of the extracted historical classified text is large, the search efficiency can be improved by constructing the preset classification table, and the text to be classified can be rapidly qualified.

The design module 300 is configured to design a preset classification model according to the frequency corresponding to the vocabulary entry of the history classification text.

The method comprises the steps that a preset classification model is designed according to the frequency corresponding to the entries of historical classified texts, the types of the preset classification model are not unique, the preset classification model needs to be used for representing classification standards of the entries, the preset classification model is also used for counting the number of texts corresponding to the entries of the same type, the statistics is carried out while the texts to be classified are classified, and a user can know the total number of the texts of the same type through the preset classification model.

In one embodiment, as shown in FIG. 12, first parsing module 100 includes a second parsing module 110, a semantic recognition module 120, a first culling module 130, and a filtering module 140.

The second parsing module 110 is configured to obtain the historical classified text, parse the historical classified text to obtain an initial entry of the historical classified text and an initial frequency corresponding to the initial entry of the historical classified text.

After the history classified text is obtained, analyzing the history classified text, wherein the analyzing process is a training process, and mainly comprises the steps of finding meaningful words and sentences from character strings of the history classified text, and screening a part of meaningless entries by analyzing the history classified text to obtain initial entries of the history classified text and initial frequency corresponding to the initial entries of the history classified text.

The semantic recognition module 120 is configured to perform semantic recognition on the extracted initial terms of the historical classified text to obtain a recognition result, and determine an experience threshold of an initial frequency according to the recognition result.

The first eliminating module 130 is configured to eliminate initial terms with initial frequency lower than an empirical threshold and corresponding initial frequency in the history classified text.

And judging whether the initial frequency corresponding to the initial entry is lower than an experience threshold, removing the initial entry and the corresponding initial frequency when the initial frequency corresponding to the initial entry is lower than the experience threshold, and keeping the initial entry and the corresponding initial frequency when the initial frequency corresponding to the initial entry is greater than or equal to the experience threshold.

And the screening module 140 is configured to obtain the entries of the history classified text and the frequency corresponding to the entries of the history classified text according to the initial entries and the corresponding initial frequency removed from the history classified text.

And after further screening the initial terms and the corresponding initial frequencies according to the determined experience threshold, directly obtaining the terms of the historical classified text and the frequencies corresponding to the terms of the historical classified text according to the initial terms and the corresponding initial frequencies after the initial terms and the corresponding initial frequencies are removed from the historical classified text.

In one embodiment, as shown in fig. 13, the second parsing module 110 further includes a length obtaining module 112, a first extracting module 114, and a second culling module 116.

And a length obtaining module 112, configured to obtain a sentence length corresponding to the history classified text.

And a first extraction module 114, configured to extract original terms and corresponding original frequencies of the history classified text according to the sentence length.

And extracting original terms and corresponding original frequencies of the historical classified text according to the sentence length n of the obtained historical classified text, wherein the original terms and the corresponding original frequencies are the terms and the corresponding frequencies which are not subjected to any eliminating treatment.

And the second eliminating module 116 is configured to eliminate the original entries meeting the preset eliminating conditions in the original entries and the corresponding original frequencies to obtain the initial entries of the history classified texts and the initial frequencies corresponding to the initial entries of the history classified texts.

In one embodiment, as shown in FIG. 14, the filtering module 140 further includes a determining module 142 and a third culling module 144.

And the judging module 142 is configured to judge whether an entry having an inclusion relationship exists in the removed entries in the history classified text.

And a third eliminating module 144, configured to eliminate the corresponding frequency of the sub-entry and the sub-entry when the characters in the sub-entry include a first character and a last character of the parent entry, so as to obtain the entry of the history classified text and the corresponding frequency of the entry of the history classified text.

And when the entry with the inclusion relation exists in the removed entries in the historical classified text, and when the characters in the sub-entries comprise the first characters and the last characters of the parent entry, removing the corresponding frequency of the sub-entries and the sub-entries, and taking the frequency of the remaining entries of the historical classified text and the corresponding frequency of the entries of the historical classified text as the qualitative and classified entries and frequency.

In one embodiment, the building module 200 includes a second extraction module 210 and a hash table creation module 220.

The second extraction module 210 is configured to extract word lengths corresponding to the terms and frequency levels corresponding to the terms according to the terms of the history classified text and the frequency corresponding to the terms of the history classified text.

And extracting the word length corresponding to the entries and the frequency level corresponding to the entries according to the entries of the historical classified text and the frequency corresponding to the entries, wherein the word length corresponding to the entries is the number of characters contained in each entry, and the frequency level corresponding to the entries is a level established according to the frequency corresponding to the entries, namely establishing a frequency range set from the lowest frequency to the highest frequency.

The hash table establishing module 220 is configured to establish a primary hash function mapping by using the word length corresponding to the entry as an argument of the first-level function, and establish a secondary hash function mapping by using the frequency level corresponding to the entry as an argument of the second-level function, so as to obtain a secondary hash table.

After extracting the word length corresponding to the entry and the frequency level corresponding to the entry, establishing primary hash function mapping by taking the word length corresponding to the entry as an independent variable of a first-level function, establishing secondary hash function mapping by taking the frequency level corresponding to the entry as an independent variable of a second-level function, and storing the entry counted in a frequency order linked list from high to low to obtain a secondary hash table.

In one embodiment, the design module 300 includes a third extraction module 310, a determination module 320, and a B-tree design module 330.

The third extraction module 310 is configured to extract the number of the entries according to the frequency corresponding to the entries of the history classified text.

The determining module 320 is configured to determine the number of keywords on the node of the B-tree according to the number of the entries.

And the B-tree design module 330 is configured to design a B-tree according to the number of the keywords.

After the number of the keywords on the node of the B tree is determined, the B tree can be designed.

For the specific definition of the text classification device, reference may be made to the above definition of the text classification method, which is not described herein again. The modules in the text classification device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 17. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing entries and corresponding frequency data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of text classification.

Those skilled in the art will appreciate that the architecture shown in fig. 17 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program:

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:

The text classification device, the computer equipment and the storage medium acquire the text to be classified, and extract the entries of the text to be classified; searching a preset classification table according to the entries of the text to be classified, acquiring the frequency corresponding to the entries of the text to be classified in the preset classification table, and determining the type of the text to be classified according to the frequency corresponding to the entries of the text to be classified; and searching in a preset classification model according to the type of the text to be classified, and classifying the text to be classified under the corresponding entries of the preset classification model. The method comprises the steps of searching a preset classification table according to the extracted entries of the texts to be classified, determining the types of the texts to be classified, determining a retrieval range, searching in a preset classification model according to the types of the texts to be classified, and accurately classifying the texts to be classified, so that the texts can be automatically qualified and classified, and the classification efficiency is improved.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of text classification, the method comprising:

2. The method according to claim 1, wherein before the steps of obtaining the text to be classified, parsing the text to be classified, and extracting the entry of the text to be classified, the method further comprises:

constructing the preset classification table according to the entries of the historical classified texts and the frequency corresponding to the entries of the historical classified texts;

and designing the preset classification model according to the frequency corresponding to the entries of the historical classification text.

3. The method of claim 2, wherein the preset classification table is a secondary hash table, and the constructing the preset classification table according to the entries of the historical classified texts and the corresponding frequencies of the entries of the historical classified texts comprises:

4. The method according to claim 2, wherein the preset classification model is a B-tree, and the designing the preset classification model according to the frequency corresponding to the entries of the historical classified text comprises:

and designing to obtain a B tree according to the number of the keywords.

5. The method of claim 2, wherein the obtaining the historical classified text, analyzing the historical classified text, and extracting the entries of the historical classified text and the corresponding frequency of the entries of the historical classified text comprises:

performing semantic recognition on the extracted initial terms of the historical classified text to obtain a recognition result, and determining an experience threshold of the initial frequency according to the recognition result;

removing initial entries with initial frequency lower than the experience threshold value and corresponding initial frequency in the historical classified text;

6. The method of claim 5, wherein obtaining the entries of the historical classified text and the corresponding frequencies of the entries of the historical classified text according to the removed initial entries and the corresponding initial frequencies in the historical classified text comprises:

judging whether the entry with an inclusion relation exists in the removed entries in the historical classified text, wherein the inclusion relation is that the characters of the first entry comprise the characters of the second entry, the first entry is a parent entry, and the second entry is a child entry;

if so, when the characters in the sub-entry comprise the first character and the last character of the father entry, eliminating the corresponding frequency of the sub-entry and the sub-entry to obtain the entry of the historical classified text and the corresponding frequency of the entry of the historical classified text.

7. The method of claim 5, wherein the parsing the history classified text to obtain initial terms of the history classified text and initial frequencies corresponding to the initial terms of the history classified text comprises:

obtaining the sentence length corresponding to the historical classified text;

and removing the original entries meeting preset removal conditions in the original entries and the corresponding original frequency to obtain the initial entries of the historical classified texts and the initial frequency corresponding to the initial entries of the historical classified texts.

8. An apparatus for classifying text, the apparatus comprising:

the acquisition module is used for acquiring a text to be classified, analyzing the text to be classified and extracting entries of the text to be classified;

the classification module is used for searching in a preset classification model according to the type of the text to be classified and classifying the text to be classified under a corresponding entry of the preset classification model;

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.