CN112579781A

CN112579781A - Text classification method and device, electronic equipment and medium

Info

Publication number: CN112579781A
Application number: CN202011581244.5A
Authority: CN
Inventors: 钱辉娟
Original assignee: Ping An Bank Co Ltd
Current assignee: Ping An Bank Co Ltd
Priority date: 2020-12-28
Filing date: 2020-12-28
Publication date: 2021-03-30
Anticipated expiration: 2040-12-28
Also published as: CN112579781B

Abstract

The invention relates to an intelligent decision making technology, and discloses a text classification method, which comprises the following steps: obtaining a plurality of text keyword subsets identifying a plurality of text categories; acquiring a target keyword set based on a text to be classified; judging whether a text keyword subset matched with the target keyword set exists or not; if so, determining the text category corresponding to the matched text keyword subset as the category of the text to be classified; if not, calculating first attribution probability values of the target keyword set corresponding to the plurality of text categories respectively, calculating a second attribution probability value set according to the plurality of calculated first attribution probability values, and determining the text category corresponding to the maximum second attribution probability value as the category of the text to be classified. The invention also relates to a block chain technology, and a target keyword set and the like can be stored in the block chain nodes. The invention also discloses a text classification device, an electronic device and a storage medium. The method and the device can solve the problem of low accuracy of text classification.

Description

Text classification method and device, electronic equipment and medium

Technical Field

The invention relates to the technical field of intelligent decision making, in particular to a text classification method, a text classification device, electronic equipment and a computer readable storage medium.

Background

With the development of the internet, more information such as person-to-person communication, a person-to-information communication track, and the like is stored in an electronic form, and for example, the information can be formed into text information. The text classification of the text information facilitates the subsequent search and processing of the file containing the text information, for example, a chat conversation related to the information of the financial product exists between a bank administrator and a user, and the text classification of the chat conversation can correctly attribute the performance of the financial product to the name of the corresponding bank administrator.

The existing text classification method generally acquires key information based on word frequency and classifies the key information according to the key information, but the method acquires the key information in the prior art and does not combine with specific context-associated text information, so that the extraction of the key information is not accurate, and the accuracy of text classification is not high.

Disclosure of Invention

The invention provides a text classification method, a text classification device, electronic equipment and a computer readable storage medium, and mainly aims to solve the problem of low accuracy of text classification.

In order to achieve the above object, the present invention provides a text classification method, including:

acquiring a historical text set, and extracting a text keyword set of the historical text set, wherein the text keyword set comprises a plurality of text keyword subsets for identifying a plurality of text categories;

acquiring a text to be classified;

preprocessing the text to be classified to obtain a standard text to be classified;

screening keywords with preset parts of speech in the standard text to be classified to obtain a candidate keyword set, and extracting a target keyword set from the candidate keyword set based on a graph sorting algorithm;

judging whether a text keyword subset matched with the target keyword set exists in the plurality of text keyword subsets;

when a text keyword subset matched with the target keyword set exists in the plurality of text keyword subsets, determining a text category corresponding to the text keyword subset matched with the target keyword set as the category of the text to be classified;

when the text keyword subsets which are matched with the target keyword set do not exist in the text keyword subsets, calculating first attribution probability values of the target keyword set corresponding to the text categories respectively by using a preset attribution probability model to obtain a first attribution probability value set, calculating a second attribution probability value set according to the first attribution probability value set and a preset attribution probability formula, and determining the text category corresponding to the maximum second attribution probability value in the second attribution probability value set as the category of the text to be classified.

Optionally, the extracting a set of text keywords of the historical text set includes:

carrying out sentence segmentation processing on the historical text set by taking the period number as a node to obtain an initial sentence subset;

carrying out stop word removal processing on each sentence in the initial sentence set to obtain a stop sentence set;

performing word segmentation processing on each sentence in the stop sentence set to obtain a word segmentation data set;

performing part-of-speech tagging on each word in the word segmentation data set to obtain a standard text set;

and extracting a text keyword word set of the standard text set.

Optionally, the extracting a target keyword set from the candidate keyword set based on a graph sorting algorithm includes:

constructing a directed weighted graph according to the candidate keyword set;

calculating the weights of a plurality of nodes in the directed weighted graph according to a preset weight calculation formula;

and summarizing nodes with the weight exceeding a preset threshold value in the directed weighted graph as target keywords of the candidate keyword set to obtain a target keyword set.

Optionally, the preset weight calculation formula includes:

wherein WS (V)_i) Represents a node V_iD is the damping coefficient, In (V)_i) To point to node V_iOf the first set of nodes, OutVV_j) Is node V_iSecond set of nodes pointed to, W_jiIs node V_iAnd V_jThe weight of the connection between.

Optionally, the obtaining a second home probability value set by calculation according to the first home probability value set and a preset home probability formula includes:

acquiring a preset time multiplier factor and a preset link frequency factor, and respectively carrying out normalization processing on the time multiplier factor and the link frequency factor to obtain a time normalization factor and a frequency normalization factor;

calculating a second attribution probability value corresponding to each first attribution probability value in the first attribution probability value set according to the time normalization factor, the frequency normalization factor and a preset attribution probability formula;

and summarizing the calculated second attribution probability value to obtain a second attribution probability value set.

Optionally, the preset attribution probability formula includes:

wherein, P_finalIs a second attribution probability value, P is a first attribution probability value, F^*Is the time normalization factor, url_i ^*And normalizing the frequency by a factor.

Optionally, the preprocessing the text to be classified includes:

and performing text error correction processing on the text to be classified.

In order to solve the above problem, the present invention also provides a text classification apparatus, including:

the text keyword set extraction module is used for acquiring a historical text set and extracting a text keyword set of the historical text set, wherein the text keyword set comprises a plurality of text keyword subsets for identifying a plurality of text categories;

the system comprises a to-be-classified text preprocessing module, a to-be-classified text preprocessing module and a classifying module, wherein the to-be-classified text preprocessing module is used for acquiring a to-be-classified text and preprocessing the to-be-classified text to obtain a to-be-classified standard text;

the target keyword set extraction module is used for screening keywords with preset parts of speech in the standard text to be classified to obtain a candidate keyword set, and extracting the target keyword set from the candidate keyword set based on a graph sorting algorithm;

a category determination module, configured to determine whether a text keyword subset matching the target keyword set exists in the plurality of text keyword subsets; if yes, determining the text category corresponding to the text keyword subset matched with the target keyword set as the category of the text to be classified; if the target keyword set does not exist in the text category classification, a preset attribution probability model is used for calculating first attribution probability values of the target keyword set corresponding to the text categories respectively to obtain a first attribution probability value set, a second attribution probability value set is obtained through calculation according to the first attribution probability value set and a preset attribution probability formula, and the text category corresponding to the maximum second attribution probability value in the second attribution probability value set is determined to be the text category to be classified.

In order to solve the above problem, the present invention also provides an electronic device, including:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores computer program instructions executable by the at least one processor to cause the at least one processor to perform the text classification method described above.

In order to solve the above problem, the present invention also provides a computer-readable storage medium storing a computer program, which when executed by a processor implements the text classification method described above.

According to the embodiment of the invention, the target keyword set is extracted from the candidate keyword set based on the graph sorting algorithm, and the graph sorting algorithm is combined with the associated text information of specific context in the candidate keyword set, so that the accuracy of extracting the target keyword set is improved, and the accuracy of text classification is further improved; and judging whether the text keyword subsets matched with the target keyword set exist in the text keyword subsets, if not, calculating a probability value by using a preset attribution probability model to classify, and solving the problem of how to classify when the text keyword subsets are not matched through the attribution probability model, thereby further improving the accuracy of classification. Therefore, the text classification method, the text classification device and the computer readable storage medium provided by the invention can solve the problem of low accuracy of text classification.

Drawings

Fig. 1 is a schematic flowchart of a text classification method according to an embodiment of the present invention;

FIG. 2 is a block diagram of a text classification apparatus according to an embodiment of the present invention;

fig. 3 is a schematic internal structural diagram of an electronic device implementing a text classification method according to an embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The embodiment of the invention provides a text classification method, and an execution subject of the text classification method includes but is not limited to at least one of electronic devices such as a server and a terminal, which can be configured to execute the method provided by the embodiment of the application. In other words, the text classification method may be performed by software or hardware installed in the terminal device or the server device, and the software may be a block chain platform. The server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like.

Fig. 1 is a schematic flow chart of a text classification method according to an embodiment of the present invention. In this embodiment, the text classification method includes:

s1, obtaining a historical text set, and extracting a text keyword set of the historical text set, wherein the text keyword set comprises a plurality of text keyword subsets for identifying a plurality of text categories.

In an embodiment of the present invention, the history text set may include a chat conversation in which the user consults different financial managers about a financial product.

In an alternative embodiment of the present invention, the historical text set may be text information obtained from the server for a period of time, for example, a chat conversation between the user and a different financial manager in the past half year obtained from the server.

In another optional embodiment of the present invention, the obtaining the historical text set includes:

acquiring an original image and carrying out main body interception processing on the original image to obtain a main body image;

and performing text extraction on the main body image by using a preset text extraction model to obtain a historical text set.

The original image is a chat screenshot of a user and a financial manager about a financial product discussion, and the preset text extraction model can be an NLP language model.

In detail, in the embodiment of the present invention, the main body intercepting process performed on the original image is to intercept an area related to text information for classification (for example, in a chat between a user and a financial manager, only an area of a chat conversation related to a financial product is intercepted), so as to avoid extracting data that does not belong to a chat conversation text, and reduce redundancy of information.

Specifically, the extracting a text keyword set of the historical text set includes:

and extracting a text keyword word set of the standard text set.

In detail, the stop word processing is to remove words without actual meanings in the initial sentence set, such as "a", "d", and the like, by using a preset stop word table.

The stop word list can be a word library of stop words of Haugh university and a word library of stop words of machine learning intelligent laboratory of Sichuan university, which are directly obtained, or the stop word list can be preset.

Further, in one embodiment of the present invention, a Jieba tool may be used to perform word segmentation on the stop sentence set, and each sentence in the stop sentence set is divided into words to obtain a word segmentation data set.

Specifically, the part-of-speech tagging is to tag words in the word data set with parts-of-speech such as verbs, nouns, and adjectives.

In the embodiment of the invention, the extracted text keyword set also comprises a plurality of text keyword subsets, and different text keyword subsets belong to different text categories. Specifically, when extracting the historical text set, the extracted different text keyword sets may be pre-classified, or the text keywords may be extracted according to the pre-classification. That is, the extracted text keywords may be extracted first and then classified, each category corresponds to one text keyword subset, or a preset category may be obtained first and then the text keyword subsets corresponding to the categories are extracted according to the preset category.

For example, when extracting according to time, text keywords extracted in different time periods are classified into one category, or when extracting according to time, text keywords with similar or similar contents at different times are classified into one category.

For another example, when extracting the chat objects, the text keywords are classified into different categories according to the chat objects, for example, chat contents of A, B, C three financial managers are extracted, the text keywords extracted based on the chat records of the user and the A financial manager are classified into one category, the text keywords extracted based on the chat records of the user and the A financial manager are classified into another category, and the text keywords extracted based on the chat records of the user and the A financial manager are classified into another category.

And S2, acquiring the text to be classified.

In the embodiment of the invention, the text to be classified can be a chat text of the user needing to determine the financial management manager and the financial management manager about the financial product.

And S3, preprocessing the text to be classified to obtain a standard text to be classified.

In the embodiment of the present invention, the preprocessing the text to be classified includes:

and performing text error correction processing on the text to be classified.

In detail, the text error correction processing includes deleting wrongly written words in the text to be classified; alternatively, the text correction may also include correcting the erroneous text (e.g., the erroneous text of a different phonetic character), for example, by modifying "what type is then" to "what type is then".

S4, screening keywords with preset parts of speech in the standard text to be classified to obtain a candidate keyword set, and extracting a target keyword set from the candidate keyword set based on a graph sorting algorithm.

In the embodiment of the invention, the step of screening out the keywords with the preset parts of speech from the standard text to be classified is to reserve the words corresponding to the preset parts of speech in the standard text to be classified, delete the words with the other parts of speech, and summarize to obtain the candidate keyword set.

In the embodiment of the present invention, the predetermined part of speech is a noun.

Specifically, the extracting a target keyword set from the candidate keyword set based on a graph sorting algorithm includes:

constructing a directed weighted graph according to the candidate keyword set;

Wherein a node in the directed weighted graph represents a candidate keyword in the set of candidate keywords.

In detail, the preset weight calculation formula includes:

wherein WS (V)_i) Represents a node V_iD is the damping coefficient, In (V)_i) To point to node V_iOf the first set of nodes, Out (V)_j) Is node V_iSecond set of nodes pointed to, W_jiIs node V_iAnd V_jThe weight of the connection between.

Preferably, the damping coefficient d represents the probability of pointing from a certain point to any other point in the directed weighted graph, and preferably, the value of the damping coefficient is 0.85.

And S5, judging whether a text keyword subset matched with the target keyword set exists in the plurality of text keyword subsets.

In the embodiment of the invention, the target keyword set can be respectively matched with the text keyword subsets one by one according to a preset sequence, and any one of the text keyword subsets can be randomly selected to be matched with the target keyword set.

Specifically, when the target keyword set is matched with the text keyword subset, one or more keywords in the target keyword set are respectively matched with one or more keywords in the text keyword subset.

In an optional embodiment of the present invention, when all keywords in the target keyword set exist in the text keyword subset, determining that the text keyword subset matches the target keyword set, otherwise, determining that the text keyword subset does not match the target keyword set; or when a plurality of keywords in the target keyword set or the target keywords exist in the text keyword subset, determining that the text keyword subset is matched with the target keyword set, otherwise, determining that the text keyword subset is not matched with the target keyword set.

In another optional embodiment of the present invention, when the keywords in the text keyword subset all exist in the target keyword set, determining that the text keyword subset matches the target keyword set, otherwise, determining that the text keyword subset does not match the target keyword set; or when a plurality of keywords in the text keyword subset or the target keywords exist in the target keyword set, determining that the text keyword subset is matched with the target keyword set, otherwise, determining that the text keyword subset is not matched with the target keyword set.

When the text keyword subsets matched with the target keyword set are obtained, whether the text keyword subsets matched with the target keyword set exist in the text keyword subsets is determined, and when the text keyword subsets matched with the target keyword set do not exist in the text keyword subsets, the text keyword subsets matched with the target keyword set do not exist in the text keyword subsets.

And S6, if the text keyword exists, determining the text category corresponding to the text keyword subset matched with the target keyword set as the category of the text to be classified.

For example, a certain subset M of text keyword words in the text keyword set comprises keywords: "fundable," product, "" preferred assault group, "" preferred link, "" purchase. The target keyword set comprises keywords: the method comprises the steps of "preferred link", "2 ten thousand yuan" and "preferred attack combination", wherein a target keyword "purchase" and "preferred attack combination" in a target keyword set exist in a text keyword subset M, the text keyword subset M is determined to be matched with the target keyword, a text keyword subset matched with the target keyword set exists in a plurality of text keyword subsets, and meanwhile, a text category corresponding to the text keyword subset M is determined to be a category of the text to be classified.

And S7, if not, calculating first attribution probability values of the target keyword sets corresponding to the plurality of text categories respectively by using a preset attribution probability model to obtain a first attribution probability value set, calculating a second attribution probability value set according to the first attribution probability value set and a preset attribution probability formula, and determining the text category corresponding to the maximum second attribution probability value in the second attribution probability value set as the category of the text to be classified.

Wherein the attribution probability model may be a BERT model (Bidirectional Encoder Representation from transforms, depth Bidirectional coding model).

Specifically, the obtaining of the second home probability value set by calculation according to the first home probability value set and a preset home probability formula includes:

In detail, the normalizing the time multiplier factor and the link frequency factor respectively to obtain a time normalization factor and a frequency normalization factor includes:

the following calculation is performed by using a preset normalization formula:

wherein, F^*Is the time normalization factor, url_i ^*F is the time multiplier factor, T is the time interval between the keywords in the target keyword set and the keywords in the text keyword subset, i is the keywords in the target keyword set, url is the frequency normalization factor_iThe number of times this keyword i appears in the target keyword set.

For example, T is the time interval between the user and the financial manager communicating with the financial product and the user's ultimate purchase of the financial product, i is the link to the financial product, url_iThe number of times a link to a financial product occurs.

Then

Further, the calculating, according to the time normalization factor, the frequency normalization factor and a preset attribution probability formula, a second attribution probability value corresponding to each first attribution probability value in the first attribution probability value set includes:

the preset attribution probability formula comprises:

According to the embodiment of the invention, the target keyword set is extracted from the candidate keyword set based on the graph sorting algorithm, and the graph sorting algorithm is combined with the associated text information of specific context in the candidate keyword set, so that the accuracy of extracting the target keyword set is improved, and the accuracy of text classification is further improved; and judging whether the text keyword subsets matched with the target keyword set exist in the text keyword subsets, if not, calculating a probability value by using a preset attribution probability model to classify, and solving the problem of how to classify when the text keyword subsets are not matched through the attribution probability model, thereby further improving the accuracy of classification. Therefore, the text classification method provided by the invention can solve the problem of low accuracy of text classification.

Fig. 2 is a schematic block diagram of a text classification apparatus according to an embodiment of the present invention.

The text classification device 100 of the present invention can be installed in an electronic device. According to the realized functions, the text classification device 100 can comprise a text keyword set extraction module 101, a text preprocessing module 102 to be classified, a target keyword set extraction module 103 and a category determination module 104. The module of the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.

In the present embodiment, the functions regarding the respective modules/units are as follows:

the text keyword set extraction module 101 is configured to obtain a historical text set, and extract a text keyword set of the historical text set, where the text keyword set includes a plurality of text keyword subsets that identify a plurality of text categories;

the text to be classified preprocessing module 102 is configured to obtain a text to be classified, and preprocess the text to be classified to obtain a standard text to be classified;

the target keyword set extraction module 103 is configured to filter keywords of a preset part of speech in the standard text to be classified to obtain a candidate keyword set, and extract the target keyword set from the candidate keyword set based on a graph sorting algorithm;

the category determining module 104 is configured to determine whether a text keyword subset matching the target keyword set exists in the plurality of text keyword subsets; if yes, determining the text category corresponding to the text keyword subset matched with the target keyword set as the category of the text to be classified; if the target keyword set does not exist in the text category classification, a preset attribution probability model is used for calculating first attribution probability values of the target keyword set corresponding to the text categories respectively to obtain a first attribution probability value set, a second attribution probability value set is obtained through calculation according to the first attribution probability value set and a preset attribution probability formula, and the text category corresponding to the maximum second attribution probability value in the second attribution probability value set is determined to be the text category to be classified.

The text keyword set extraction module 101 is configured to obtain a historical text set, and extract a text keyword set of the historical text set, where the text keyword set includes a plurality of text keyword subsets that identify a plurality of text categories.

Specifically, the text keyword set extraction module 101 includes:

a history text acquisition unit for acquiring a history text set;

and the historical text processing unit is used for extracting the text keyword set of the historical text set.

The history text acquisition unit is specifically configured to:

Specifically, the history text processing unit is specifically configured to:

and extracting a text keyword word set of the standard text set.

The text to be classified preprocessing module 102 is configured to obtain a text to be classified.

The text to be classified preprocessing module 102 is further configured to preprocess the text to be classified to obtain a standard text to be classified.

and performing text error correction processing on the text to be classified.

In detail, the text error correction processing comprises deleting obviously wrongly written words in the text to be classified; alternatively, the text correction may also include correcting the erroneous text (e.g., the erroneous text of a different phonetic character), for example, by modifying "what type is then" to "what type is then".

The target keyword set extraction module 103 is configured to filter keywords of a preset part of speech in the standard text to be classified to obtain a candidate keyword set, and extract the target keyword set from the candidate keyword set based on a graph sorting algorithm.

constructing a directed weighted graph according to the candidate keyword set;

In detail, the preset weight calculation formula includes:

The category determining module 104 is configured to determine whether a text keyword subset matching the target keyword set exists in the plurality of text keyword subsets.

The category determining module 104 is further configured to determine, if there is a text keyword subset matching the target keyword set, a text category corresponding to the text keyword subset matching the target keyword set as the category of the text to be classified.

The category determining module 104 is further configured to, if there is no text keyword subset matching the target keyword set, calculate, by using a preset attribution probability model, first attribution probability values of the target keyword set corresponding to the plurality of text categories, respectively, to obtain a first attribution probability value set, calculate, according to the first attribution probability value set and a preset attribution probability formula, a second attribution probability value set, and determine that a text category corresponding to a maximum second attribution probability value in the second attribution probability value set is the category of the text to be categorized.

the following calculation is performed by using a preset normalization formula:

Then

the preset attribution probability formula comprises:

According to the embodiment of the invention, the target keyword set is extracted from the candidate keyword set based on the graph sorting algorithm, and the graph sorting algorithm is combined with the associated text information of specific context in the candidate keyword set, so that the accuracy of extracting the target keyword set is improved, and the accuracy of text classification is further improved; and judging whether the text keyword subsets matched with the target keyword set exist in the text keyword subsets, if not, calculating a probability value by using a preset attribution probability model to classify, and solving the problem of how to classify when the text keyword subsets are not matched through the attribution probability model, thereby further improving the accuracy of classification. Therefore, the text classification device provided by the invention can solve the problem of low accuracy of text classification.

Fig. 3 is a schematic structural diagram of an electronic device implementing the text classification method according to the present invention.

The electronic device 1 may comprise a processor 10, a memory 11 and a bus, and may further comprise a computer program, such as a text classification program 12, stored in the memory 11 and executable on the processor 10.

The memory 11 includes at least one type of readable storage medium, which includes flash memory, removable hard disk, multimedia card, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, such as a removable hard disk of the electronic device 1. The memory 11 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 11 may be used not only to store application software installed in the electronic device 1 and various types of data, such as codes of the text classification program 12, but also to temporarily store data that has been output or is to be output.

The processor 10 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects various components of the whole electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device 1 by running or executing programs or modules (e.g., executing a text classification program, etc.) stored in the memory 11 and calling data stored in the memory 11.

The bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connection communication between the memory 11 and at least one processor 10 or the like.

Fig. 3 shows only an electronic device with components, and it will be understood by those skilled in the art that the structure shown in fig. 3 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than those shown, or some components may be combined, or a different arrangement of components.

For example, although not shown, the electronic device 1 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so as to implement functions of charge management, discharge management, power consumption management, and the like through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device 1 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.

Further, the electronic device 1 may further include a network interface, and optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a bluetooth interface, etc.), which are generally used for establishing a communication connection between the electronic device 1 and other electronic devices.

Optionally, the electronic device 1 may further comprise a user interface, which may be a Display (Display), an input unit (such as a Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic device 1 and for displaying a visualized user interface, among other things.

It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.

The memory 11 of the electronic device 1 stores a text classification program 12 that is a combination of instructions that, when executed in the processor 10, implement:

acquiring a text to be classified;

if yes, determining the text category corresponding to the text keyword subset matched with the target keyword set as the category of the text to be classified;

if the target keyword set does not exist in the text category classification, a preset attribution probability model is used for calculating first attribution probability values of the target keyword set corresponding to the text categories respectively to obtain a first attribution probability value set, a second attribution probability value set is obtained through calculation according to the first attribution probability value set and a preset attribution probability formula, and the text category corresponding to the maximum second attribution probability value in the second attribution probability value set is determined to be the text category to be classified.

Further, the integrated modules/units of the electronic device 1, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. The computer-readable storage medium may be volatile or non-volatile, and may include, for example: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).

The present invention also provides a computer-readable storage medium, which stores a computer program that, when executed by a processor of an electronic device, can implement:

acquiring a text to be classified;

Further, the computer usable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.

The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any accompanying claims should not be construed as limiting the claim concerned.

Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms second, etc. are used to denote names, but not any particular order.

Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims

1. A method of text categorization, the method comprising:

acquiring a text to be classified;

2. The method of text categorization as claimed in claim 1, wherein said extracting a set of text keywords of said set of historical text comprises:

and extracting a text keyword word set of the standard text set.

3. The method of text categorization as claimed in claim 1, wherein the extracting a set of target keywords from the set of candidate keywords based on a graph ordering algorithm comprises:

constructing a directed weighted graph according to the candidate keyword set;

4. The text classification method according to claim 3, characterized in that the preset weight calculation formula comprises:

5. The method of text classification according to claim 1, wherein said calculating a second set of home probability values from said first set of home probability values and a preset home probability formula comprises:

6. The text classification method according to claim 5, characterized in that the preset attribution probability formula comprises:

7. The text classification method according to any one of claims 1 to 6, characterized in that the preprocessing the text to be classified comprises:

and performing text error correction processing on the text to be classified.

8. A text categorization apparatus, characterized in that the apparatus comprises:

9. An electronic device, characterized in that the electronic device comprises:

at least one processor; and the number of the first and second groups,

the memory stores computer program instructions executable by the at least one processor to cause the at least one processor to perform the method of text categorization according to any of claims 1 to 7.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out a method of text categorization according to any of the claims 1 to 7.