CN113254643A

CN113254643A - Text classification method and device, electronic equipment and

Info

Publication number: CN113254643A
Application number: CN202110591719.7A
Authority: CN
Inventors: 张启坤; 吴瑧志
Original assignee: Beijing Lynxi Technology Co Ltd
Current assignee: Beijing Lynxi Technology Co Ltd
Priority date: 2021-05-28
Filing date: 2021-05-28
Publication date: 2021-08-13
Anticipated expiration: 2041-05-28
Also published as: CN113254643B

Abstract

the application discloses a text classification method and device and electronic equipment. The text classification method comprises the following steps: extracting key phrases in the text to be classified; determining confidence degrees corresponding to M text classifications according to confidence degrees corresponding to the N key phrases under the condition that the N key phrases matched with character strings of vocabulary nodes in a preset knowledge graph exist in the key phrases in the texts to be classified, wherein the M text classifications correspond to M sub-graphs in the preset knowledge graph, the character strings of the vocabulary nodes in the M sub-graphs are respectively matched with at least one key phrase in the N key phrases, and M and N are positive integers; and under the condition that the confidence coefficient of a target text classification in the M text classifications is larger than a first threshold value, determining that the text to be classified belongs to the target text classification, wherein the target text classification is the text classification with the maximum confidence coefficient in the M text classifications. The method and the device for text classification can improve the efficiency of the text classification method.

Description

text classification method and device, electronic equipment and

Technical Field

The application belongs to the technical field of computers, and particularly relates to a text classification method and device and electronic equipment.

Background

With the proliferation of mobile communication devices, more and more unstructured or semi-structured data resources (e.g., textual information) are emerging. How to accurately and efficiently determine the category of unstructured or semi-structured text information becomes an urgent problem to be solved.

Disclosure of Invention

The embodiment of the application aims to provide a text classification method, a text classification device and electronic equipment, which can accurately and efficiently determine the category of text information.

In order to solve the technical problem, the present application is implemented as follows:

in a first aspect, an embodiment of the present application provides a text classification method, where the method includes:

extracting key phrases in the text to be classified;

determining confidence degrees corresponding to M text classifications according to confidence degrees corresponding to the N key phrases under the condition that the N key phrases are matched with character strings of vocabulary nodes in a preset knowledge graph in the texts to be classified, wherein the M text classifications correspond to M sub graphs in the preset knowledge graph, the character strings of the vocabulary nodes in the M sub graphs are respectively matched with at least one key phrase in the N key phrases, and M and N are integers which are larger than or equal to 1;

and under the condition that the confidence of a target text classification in the M text classifications is greater than a first threshold value, determining that the text to be classified belongs to the target text classification, wherein the target text classification is the text classification with the highest confidence in the M text classifications.

In a second aspect, an embodiment of the present application provides a text classification apparatus, including:

the first extraction module is used for extracting key phrases in the text to be classified;

the first determining module is used for determining confidence degrees corresponding to M text classifications according to the confidence degrees corresponding to N key phrases under the condition that the N key phrases are matched with the character strings of the vocabulary nodes in a preset knowledge graph in the to-be-classified texts, wherein the M text classifications correspond to M sub-graphs in the preset knowledge graph, the character strings of the vocabulary nodes in the M sub-graphs are respectively matched with at least one key phrase in the N key phrases, and M and N are integers larger than or equal to 1;

a second determining module, configured to determine that the text to be classified belongs to the target text classification when a confidence of a target text classification in the M text classifications is greater than a first threshold, where the target text classification is a text classification with a highest confidence among the M text classifications.

In a third aspect, an embodiment of the present application provides an electronic device, which includes a processor, a memory, and a program or instructions stored on the memory and executable on the processor, and when executed by the processor, the program or instructions implement the steps of the method according to the first aspect.

In a fourth aspect, embodiments of the present application provide a readable storage medium, on which a program or instructions are stored, which when executed by a processor implement the steps of the method according to the first aspect.

In a fifth aspect, an embodiment of the present application provides a chip, where the chip includes a processor and a communication interface, where the communication interface is coupled to the processor, and the processor is configured to execute a program or instructions to implement the method according to the first aspect.

In the embodiment of the application, extracting key phrases in the text to be classified; determining confidence degrees corresponding to M text classifications according to confidence degrees corresponding to the N key phrases under the condition that the N key phrases are matched with character strings of vocabulary nodes in a preset knowledge graph in the texts to be classified, wherein the M text classifications correspond to M sub graphs in the preset knowledge graph, the character strings of the vocabulary nodes in the M sub graphs are respectively matched with at least one key phrase in the N key phrases, and M and N are integers which are larger than or equal to 1; and under the condition that the confidence of a target text classification in the M text classifications is greater than a first threshold value, determining that the text to be classified belongs to the target text classification, wherein the target text classification is the text classification with the highest confidence in the M text classifications. Thus, the efficiency of text classification can be improved.

Drawings

Fig. 1 is a flowchart of a text classification method provided in an embodiment of the present application;

fig. 2a is an architecture diagram of a text classification method provided in an embodiment of the present application;

fig. 2b is a flowchart of extracting a keyword group from a text to be classified in a text classification method according to an embodiment of the present application;

fig. 2c is a schematic diagram illustrating a relationship between word vectors of keywords and word vectors of keyword groups in a text classification method according to an embodiment of the present application;

fig. 2d is a structural diagram of a preset knowledge graph in a text classification method according to an embodiment of the present application;

fig. 2e is a flowchart of extracting a key group from a history classified text in a text classification method according to an embodiment of the present application;

fig. 2f is a schematic diagram illustrating a relationship between a keyword and a keyword group in a text classification method according to an embodiment of the present application;

FIG. 3 is a flow chart of another text classification method provided by the embodiments of the present application;

fig. 4 is a structural diagram of a text classification apparatus according to an embodiment of the present application;

fig. 5 is a block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms first, second and the like in the description and in the claims of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that embodiments of the application may be practiced in sequences other than those illustrated or described herein, and that the terms "first," "second," and the like are generally used herein in a generic sense and do not limit the number of terms, e.g., the first term can be one or more than one. In addition, "and/or" in the specification and claims means at least one of connected objects, a character "/" generally means that a preceding and succeeding related objects are in an "or" relationship.

The embodiment of the present application is applied to text classification, where the text to be classified may be unstructured or semi-structured text information, for example: short messages, voice recognition results in the phone, WeChat messages, mail, and the like. After the text is classified, relevant processing corresponding to the classification result of the text is conveniently carried out on the text in a targeted manner by staff or electronic equipment, so that the useful information can be technically and accurately acquired from the text.

In the related art, unstructured or semi-structured text information may be classified, so as to perform targeted processing on the classified text information, for example: the classification of events mostly depends on the empirical analysis of the staff. The general classification process is as follows: 1) empirically summarizing high-frequency vocabularies of different classifications; 2) searching high-frequency words in the text to be classified to roughly screen the text to be classified; 3) the text to be classified is finely classified by comparing the text to be classified with the keywords in the classified text, and different texts to be classified are divided into specific classification results; 4) and shunting the texts to be classified according to the fine classification result so as to perform key problem analysis on the texts corresponding to the classification result in a targeted manner.

Therefore, the processing flow of the text classification method in the related art is heavy and inefficient.

For convenience of description, in the embodiment of the present application, only the text to be classified is taken as an example, and the description is performed by taking:

for example: with the popularization of mobile communication devices, more and more events are collected to information processing departments in unstructured or semi-structured formats in the form of information such as telephone calls, short messages, WeChats, mails and the like. Thus, information handling departments will receive a large amount of text data resources each day, and these spoken text data resources may contain a lot of valuable event information, such as: personnel name, license plate number, address, etc. However, it is often difficult for a worker to obtain accurate and effective event information from a large amount of unstructured data resources, so analyzing and extracting key information from a large amount of text data resources is a topic with great research significance and application value.

In the current event classification, high-frequency words in various events are summarized based on working experience of workers, so that when a certain high-frequency word is found in a text to be classified, the event classification of the text to be classified is determined as the event classification corresponding to the high-frequency word.

However, many high-frequency words summarized by experience mostly belong to spoken language, it is difficult to exhaust various similar expressions of the high-frequency words, and the event information cannot be well represented, resulting in incomplete event screening, such as: the fact that the electric tricycle is stolen and the fact that the electric bicycle is stolen are similar event contents, but the spoken language expression modes are different, and therefore a worker cannot exhaust the multiple expression modes of the event contents according to experience, and therefore the event contents in the spoken language event text cannot be searched easily.

As can be seen from the above, in the related art, it is difficult for the staff to accurately judge the nature of the event (event category) in time for a large number of specific events generated every day, so that it is not easy for the staff to take measures corresponding to the event category. The method and the device for classifying the text information can be applied to classification of the text information, so that when the text classification method provided by the embodiment of the application is applied to classification of the events, efficiency and reliability of event classification can be improved.

The text classification method, the text classification device, the electronic device, and the readable storage medium according to the embodiments of the present application are described in detail below with reference to the accompanying drawings.

Referring to fig. 1, which is a flowchart of a text classification method according to an embodiment of the present application, as shown in fig. 1, the method may include the following steps:

step 101, extracting key phrases in the text to be classified.

In an implementation, the above keyword group representation includes a combination of at least two keywords, and the separation distance between the at least two keywords in the text to be classified is short, for example: the keyword set includes a noun and a verb, the first position of the noun in the text to be classified is located in the Z characters after the second position of the verb in the text to be classified.

The keyword group can express more accurate and complete semantics compared with the keywords. For example: assuming that only strings of keywords are matched, event a: finding wallet stolen on subway, and event B: the stolen apples in the bicycle basket are matched with the character string of the keyword 'stolen', but the event A indicates that the wallet is stolen, and the event B indicates that the fruits are stolen, and the properties of the two are completely different. In the embodiment of the present application, the following phrases: wallet/handbag/money + stolen to characterize event a and general keyword set: the apple/fruit basket/food + is stolen to represent the characteristics of the event B, so that the accuracy of the matching result obtained after the character string matching is carried out based on the keyword group is higher.

In practice, the above-mentioned key phrases may be determined by the following process:

performing word segmentation processing and word deactivation processing on a text to be classified to obtain a plurality of words in the text to be classified, and determining the occurrence frequency of each word in the text to be classified through word frequency statistics;

determining the participles with the occurrence frequency larger than the preset frequency as keywords;

the keywords are grouped according to their part of speech, for example: dividing the keywords into: verbs, nouns;

combining every two keywords with different parts of speech and similar occurrence frequencies to obtain a plurality of keyword combinations;

respectively searching each keyword combination in the text to be classified, and determining that the keyword combination is a keyword group when determining that the position information of two keywords in the keyword combinations meets a preset condition; wherein, the positional information of two keywords satisfies the preset condition, include: the precedence order of the two keywords is correct, for example: noun-first-verb-second or verb-first-noun, and the number of characters spaced between two keywords is small (e.g., the number of spaced characters is less than 5, 10, etc.).

102, under the condition that N key phrases matched with character strings of vocabulary nodes in a preset knowledge graph exist in the key phrases in the text to be classified, determining confidence degrees corresponding to M text classifications according to the confidence degrees corresponding to the N key phrases, wherein the M text classifications correspond to M sub-graphs in the preset knowledge graph, the character strings of the vocabulary nodes in the M sub-graphs are respectively matched with at least one key phrase in the N key phrases, and M and N are integers larger than or equal to 1.

The M text classifications and the M sub-maps are in one-to-one correspondence, and the character strings of the vocabulary nodes of any one of the M sub-maps are matched with at least one key phrase in the N key phrases.

Determining the confidence degrees corresponding to the M text classifications according to the confidence degrees corresponding to the N key phrases may include: for each sub-map in the M sub-maps, at least one keyword group matched with the character string of the vocabulary node thereof may be determined, and the maximum confidence corresponding to the keyword group in the matched at least one keyword group is determined as the confidence of the text classification corresponding to the sub-map.

In the related technology, a knowledge graph is constructed based on three parts, namely a keyword, a high-frequency phrase (namely the keyword phrase), a word vector of the keyword and the high-frequency phrase, the keyword is divided into a central word and a related word, the central word is generally the word or the phrase which appears most in an event, and the related word is generally a synonym or a similar word of the central word. The high-frequency phrase is generally composed of two related central words. The central word, the associated word and the high-frequency word group all exist in a graph node form, and each of the central word, the associated word and the high-frequency word group takes a self word vector as a characteristic attribute and takes a correlation relationship (word vector similarity) as an adjacent edge connection weight.

In this embodiment of the application, the preset knowledge graph may be determined based on historical text classification information, specifically, the keywords and the high-frequency phrases in the preset knowledge graph are extracted from the historical text classification information, and the historical text classification information includes text information and an actual classification result of the text information. In addition, for some vocabulary with low occurrence frequency extracted from the historical text classification information, the vocabulary can be added into the key phrase in a setting mode. For example: as shown in fig. 2a, after performing word segmentation processing and stop word extraction processing on the historical text classification information to obtain segmented vocabularies in the historical text classification information, keywords and keyword groups in the knowledge graph may be determined based on the occurrence frequency of the segmented vocabularies in the historical text classification information, and in this process, keyword groups may be added or deleted in the knowledge graph to update the knowledge graph according to the added or deleted keyword groups. Of course, as shown in fig. 2a, in implementation, keywords and keyword groups appearing in the text to be classified may also be added to the knowledge graph to update the knowledge graph.

The preset knowledge graph in the embodiment of the application comprises a plurality of sub-graphs, different sub-graphs take different text classifications (such as event classification) as root nodes, the key words and the high-frequency word groups are sequentially connected to form a sub-graph of the current text classification, and the whole preset knowledge graph is formed by all the sub-graph of the text classification. In other words, the M text classifications recorded in step 102 correspond to M sub-maps in the preset knowledge-map, and may be expressed as: the root nodes of the M sub-maps are respectively in one-to-one correspondence with the M text classifications.

It should be noted that, in the keyword groups in the text to be classified recorded in step 102, N keyword groups that are matched with the character strings of the vocabulary nodes in the preset knowledge graph exist, which can be understood as follows: each of the N keyword groups includes at least two keywords, and the at least two keywords are respectively the same as keywords associated with at least two vocabulary nodes connected in the preset knowledge graph.

In addition, in implementation, 1 or more keyword groups may be extracted from the text to be classified, and when the number of the keyword groups is multiple, each keyword group may be respectively matched with a vocabulary node in each sub-map, so that there may be a case where a vocabulary node in one sub-map is respectively matched with multiple keyword groups, that is, N may be greater than M, and at this time, the confidence corresponding to M text classifications is determined according to the confidence corresponding to N keyword groups, which may be understood as: and for each sub-atlas, taking the highest confidence level in the matched key phrases as the confidence level of the sub-atlas, namely as the confidence level of the text classification corresponding to the sub-atlas.

Of course, in practical application, there may be a case where the same keyword group in the text to be classified is respectively matched with the vocabulary nodes in the plurality of sub-maps, that is, N may also be smaller than M.

Step 103, under the condition that the confidence of a target text classification in the M text classifications is greater than a first threshold, determining that the text to be classified belongs to the target text classification, wherein the target text classification is a text classification with the highest confidence among the M text classifications.

Step 102 may be a recursive process, and in implementation, the confidence corresponding to each text classification may be determined (there may be a case where all vocabulary nodes in a certain sub-map are not matched with the keyword group of the text to be classified, at this time, the confidence of the certain sub-map is equal to 0 by default), and step 103 may determine that one of all text classifications in the preset knowledge map with the highest confidence is the target text classification, compare the confidence of the target text classification with a first threshold in numerical size, and determine that the text to be classified belongs to the target text classification when the confidence is greater than the first threshold.

In the embodiment of the application, extracting key phrases in the text to be classified; determining confidence degrees corresponding to M text classifications according to confidence degrees corresponding to the N key phrases under the condition that the N key phrases are matched with character strings of vocabulary nodes in a preset knowledge graph in the texts to be classified, wherein the M text classifications correspond to M sub graphs in the preset knowledge graph, the character strings of the vocabulary nodes in the M sub graphs are respectively matched with at least one key phrase in the N key phrases, and M and N are integers which are larger than or equal to 1; and under the condition that the confidence of a target text classification in the M text classifications is greater than a first threshold value, determining that the text to be classified belongs to the target text classification, wherein the target text classification is the text classification with the highest confidence in the M text classifications. In this way, in view of that the keyword groups can embody semantics better, the keyword groups in the text to be classified and the vocabulary nodes in the preset knowledge graph are subjected to character string matching, so that the obtained matching result can embody semantic relevance between the text to be classified and the vocabulary nodes better, and further, after the matching degree between each keyword group in the text to be classified and each vocabulary node in the preset knowledge graph is determined in a sequential recursion manner, the highest matching degree between the vocabulary node in which the sub-graph is matched with the keyword group in the text to be classified can be determined, and the text to be classified belongs to the text classification corresponding to the sub-graph is determined according to the highest matching degree, so that a worker does not need to manually compare the text to be classified and each text classification, and the text classification efficiency is improved.

As an optional implementation manner, in a case that the confidence of the target text classification is less than or equal to the first threshold, the method further includes:

determining fusion weights corresponding to the sub-image spectrums in the preset knowledge spectrum respectively based on the first weight, the second weight, the third weight and the confidence coefficient of the target text classification;

determining that the text to be classified belongs to the text classification corresponding to the target fusion weight when the target fusion weight is larger than a second threshold, wherein the target fusion weight is the maximum value of the fusion weights corresponding to the sub-image spectra in the preset knowledge map;

the first weight is used for representing similarity between word vectors of keywords in the text to be classified and word vectors of vocabulary nodes in the preset knowledge graph, the second weight is used for representing similarity between word vectors of the keyword groups in the text to be classified and word vectors of the vocabulary nodes in the preset knowledge graph, and the third weight is used for representing similarity between the keywords in the text to be classified and character strings of the vocabulary nodes in the preset knowledge graph.

In a specific implementation, the first weight is used to represent a similarity between a word vector of a keyword in the text to be classified and a word vector of a vocabulary node in the preset knowledge graph, and may be understood as: the similarity between the word vectors of the keywords in the text to be classified and the word vectors of the vocabulary nodes in the preset knowledge graph is positively correlated with the first weight; the second weight is used for representing the similarity between the word vector of the keyword group in the text to be classified and the word vector of the vocabulary node in the preset knowledge graph, and can be understood as follows: the second weight is positively correlated with the similarity between the word vectors of the key word groups in the text to be classified and the word vectors of the vocabulary nodes in the preset knowledge graph; the third weight is used for representing the similarity between the keywords in the text to be classified and the character strings of the vocabulary nodes in the preset knowledge graph, and can be understood as follows: and the third weight is positively correlated with the similarity between the keywords in the text to be classified and the character strings of the vocabulary nodes in the preset knowledge graph.

In implementation, the keyword group may be a combination of two keywords with close frequency of occurrence and different parts of speech in the text to be classified, for example: as shown in fig. 2b, in an implementation, text preprocessing, i.e. chinese word segmentation and word de-stop processing, may be performed on a text to be classified to obtain a plurality of keywords in the text to be classified, and then the keywords are classified according to parts of speech, for example: dividing the text to be classified into two types according to verbs and nouns, sequencing classified keywords of each part of speech according to the frequency of occurrence in the text to be classified, namely sequencing classified keywords of each part of speech according to word frequency, calculating the confidence coefficient of each keyword, then combining the two keywords with different parts of speech and close word frequency to obtain a keyword combination, after the keyword combination is screened out from the text to be classified in the forward direction through the process, reversely verifying whether the keyword combination is included in the text to be classified, and determining that the keyword combination is a keyword group under the condition that the keyword combination is included in the text to be classified. The reverse verification process may specifically be that positions of keywords in the keyword combination are respectively searched in the text to be classified, and when the front and back sequence of the two is correct and the interval characters are small, it is determined that the text to be classified includes the keyword combination.

In this embodiment, the fusion weight determined based on the first weight, the second weight, the third weight, and the confidence of the target text classification can comprehensively reflect the similarity of word vectors of the keywords between the corresponding sub-image spectrum and the text to be classified, the similarity of word vectors of the keyword groups, and the similarity of character strings, so that the text classification result determined based on the target fusion weight is more reliable.

Optionally, the first weight is determined according to similarity between word vectors of the keywords in the text to be classified and word vectors of second vocabulary nodes in the preset knowledge graph and a first node distance weight, the first node distance weight is determined according to a correlation between the second vocabulary nodes and a root node, and the word vectors of the second vocabulary nodes are matched with the word vectors of the keywords in the text to be classified;

the second weight is determined according to the similarity between the word vector of the keyword group and the word vector of a third word node in the preset knowledge graph and a second node distance weight, the second node distance weight is determined according to the correlation between the third word node and a root node, and the word vector of the third word node is matched with the word vector of the keyword group;

the third weight is determined according to the similarity between the keyword and a character string of a fourth vocabulary node in the preset knowledge graph and a third node distance weight, the third node distance weight is determined according to the correlation between the fourth vocabulary node and a root node, and the fourth vocabulary node character string is matched with the character string of the keyword.

In a specific implementation, the word vector of the keyword group may be determined according to the word vector and the confidence of the keyword included therein, for example: as shown in fig. 2C, assuming that the confidence of the keyword a is C _ a, the confidence of the keyword b is C _ b, the word vector of the keyword a is V _ a, and the word vector of the keyword b is V _ b, if the keyword a and the keyword b form a keyword group a-b, the confidence of the keyword group is: c _ a (1+ C _ b), and the word vector of the keyword group is: c _ a × V _ a + C _ b × V _ b.

In addition, the determining the distance weight of the first node based on the correlation between the second vocabulary node and the root node can be understood as follows: and performing multiplication on the correlation coefficient between the second vocabulary node and the root node to obtain the distance weight of the first node, for example: as shown in fig. 2d, assuming that the second vocabulary node is the related word a, the correlation coefficient between the vocabulary node and the root node a is: the product of a first correlation coefficient between the related word a and the center word B and a second correlation coefficient between the center word B and the root node a.

In addition, the word vector of the keyword is matched with the word vector of the second vocabulary node in the preset knowledge graph, and it can be understood that: the similarity between the word vector of the keyword and the word vector of the second vocabulary node is the largest, in other words, the similarity between the word vector of each keyword in the text to be classified and the vocabulary of each vocabulary node in the preset knowledge graph needs to be respectively determined, and the maximum value of the similarity is selected to determine the vocabulary node corresponding to the similarity with the maximum value as the second vocabulary node. For example: the first weight may be calculated by the following formula:

G1_c＝sigmoid(max{i∈(1，n)，j∈(1，m)|f(V_i，W_cj)}) ×sigmoid(max_id)

wherein, G1_ c represents a first weight value; sigmoid represents normalization processing, so that the value after normalization processing is between 0 and 1; n is the total number of the keywords in the text to be classified; v_iA word vector representing the ith keyword in the text to be classified; w_cjThe word vectors represent the jth vocabulary node in the c sub-map in the preset knowledge map; f (V)_i，W_cj) Shows to obtain V_iAnd W_cjA function of the similarity between; m represents the total number of vocabulary nodes in the c-th sub-map in the preset knowledge map; max _ id denotes: and under the condition that the similarity between the word vector of the ith keyword in the text to be classified and the word vector of the jth vocabulary node in the c sub-map in the preset knowledge map is the maximum value of the similarity between the word vector of each keyword node and the word vector of each vocabulary node in the preset knowledge map, determining the product of correlation coefficients between the jth vocabulary node and the root node of the c sub-map as a first node distance weight.

In practice, f (V) is_i，W_cj) Can be expressed as f (V)_i，W_cj)＝cos(V_i，W_cj)。

Similarly, the word vector of the keyword group is matched with the word vector of the third vocabulary node in the preset knowledge graph, and it can be understood that: and acquiring the similarity between the word vector of each keyword group in the text to be classified and the word vector of each vocabulary node in the preset knowledge graph, selecting the corresponding vocabulary node with the maximum similarity as a third vocabulary node, acquiring the product of correlation coefficients between the third vocabulary node and the corresponding root node as a second node distance weight, and multiplying the second node distance weight by the similarity with the maximum value to obtain a second weight.

In addition, the matching of the keyword and the character string of the fourth vocabulary node in the preset knowledge graph can be understood as follows: and the characters of the keywords are the same as the characters of the node vocabulary corresponding to the fourth vocabulary node, so that the similarity of the character strings between the keywords and the fourth vocabulary node is determined to be more than 0, at the moment, the product of the correlation coefficients between the fourth vocabulary node and the corresponding root node is determined as a third node distance weight, and the third weight is determined as the product of the similarity of the character strings between the keywords and the fourth vocabulary node and the third node distance weight.

For example: calculating the similarity of character string matching by the following formula:

f(x，y)＝{1,if x＝y；0,other}

and f (x, y) represents that whether the character strings of the keyword x and the vocabulary node y are the same or not is judged, under the condition of the same character strings, the value of f (x, y) is determined to be equal to 1, and otherwise, the value of f (x, y) is determined to be equal to 0.

For example: the first weight, the second weight and the third weight of each text classification can be fused through the following formulas to obtain a category confidence set:

Class_max

＝Max(max{c∈(1，C)|G1_c×X1+G2_c×X2+G3_c×X3})

wherein, Class _ max represents the fusion weight, C represents the number of sub-image spectrums in the preset knowledge graph, namely the number of text classifications; g1_ c represents a first weight corresponding to the c-th sub-map in the preset knowledge map; g2_ c represents a second weight corresponding to the c-th sub-map in the preset knowledge map; g3_ c represents a third weight corresponding to the c-th sub-map in the preset knowledge map; x1 represents a weight value of the first weight; x2 represents a weight value of the second weight; x3 represents the weight value of the third weight.

In implementation, the values of X1, X2, and X3 may be preset or adjusted according to the application scene of the text to be classified, and only the sum of X1, X2, and X3 needs to be equal to 1, for example: x1 ═ 0.5; x2 ═ 0.3; x3 ═ 0.2.

After the category confidence set is obtained, the maximum confidence value of each text classification determined in the keyword group character string matching process may be added to an element corresponding to the text classification in the category confidence set to obtain the fusion weight.

For example: through the matching of the keyword group character strings, when the keyword group is matched with the character strings of the vocabulary nodes in a certain sub-map, determining a confidence set of the text classification corresponding to the sub-graph spectrum and a confidence degree comprising the key phrase, when a plurality of key word groups in the text to be classified are respectively matched with character strings of different vocabulary nodes in the same sub-map, the confidence set of the text classification corresponding to the sub-graph spectrum includes the respective confidences corresponding to the plurality of keyword groups, and the confidence coefficient with the maximum value is selected as the final confidence coefficient of the text classification corresponding to the sub-image spectrum, so that recursion is performed in sequence, so as to determine the final confidence degrees of the text classification respectively corresponding to each sub-map in the preset knowledge map and obtain the maximum value in the final confidence degrees, and under the condition that the maximum value in the final confidence degrees is larger than a second threshold value, determining that the text to be classified belongs to the text classification corresponding to the final confidence degree with the maximum value.

Of course, for the maximum value of the confidence of the text classification corresponding to each sub-graph spectrum in the preset knowledge graph based on the first weight, the second weight, the third weight and each sub-graph spectrum in the preset knowledge graph, the specific implementation manner of determining the fusion weight corresponding to each sub-graph spectrum in the preset knowledge graph may include multiple types, for example: and respectively carrying out weighted summation on the first weight, the second weight and the third weight, and the like.

As an optional implementation manner, the determining, based on the first weight, the second weight, the third weight, and the confidence of the target text classification, a fusion weight corresponding to each sub-graph spectrum in the preset knowledge graph includes:

normalizing the first weight, the second weight and the third weight;

adding the product of the first weight value after normalization processing and the first node distance weight value with the confidence coefficient of the target text classification to obtain a first value;

adding the product of the second weight value after normalization processing and the second node distance weight value with the confidence coefficient of the target text classification to obtain a second value;

adding the product of the third weight value after the normalization processing and the third node distance weight value with the confidence coefficient of the target text classification to obtain a third value;

determining a fusion weight value associated with each sub-graph spectrum in the preset knowledge graph based on a maximum value of the first value, the second value and the third value corresponding to the same text classification.

In a specific implementation, the process of weight fusion may include the following specific steps:

step one, for each text classification, performing step-by-step weight multiplication (continuously multiplying correlation coefficients (namely positive and negative correlation coefficients) between matched nodes and root nodes) respectively according to distances and correlation coefficients between nodes matched with keywords or keyword groups or character strings of the keywords in a sub-map corresponding to the text classification to obtain node distance weights (namely respectively solving a first node distance weight, a second node distance weight and a third node distance weight);

step two, respectively calculating sigmoid of the first weight, the second weight and the third weight (carrying out normalization processing), and then multiplying the normalized weights by corresponding node distance weights (the first weight corresponds to the first node distance weight, the second weight corresponds to the second node distance weight, and the third weight corresponds to the third node distance weight);

and step three, combining the text classification corresponding to the C _ max and the confidence coefficient of the text classification to obtain a fusion weight of the text classification.

In this embodiment, when the confidence of the target text classification is less than or equal to the first threshold, that is, the confidence of the text classification corresponding to all sub-maps in the preset knowledge map is less than or equal to the first threshold, which indicates that the extracted keyword group in the text to be classified cannot accurately describe the semantics of the text to be classified, or that there is no vocabulary in the preset knowledge map that is consistent with the extracted keyword group in the text to be classified, it is necessary to match the keyword, the word vector of the keyword group, and the character string of the keyword with the vocabulary node in the preset knowledge map in this embodiment.

The word vectors of the keywords and the keyword groups and the character strings of the keywords are respectively fused with the matching results of the vocabulary nodes in the preset knowledge graph, so that the problem that when only the character string matching results of the keywords are considered, other character strings with the same semantics cannot be included due to the keywords can be avoided, for example: "bicycle" is similar to the meaning of "bicycle", "bicycle" etc. but the character string is different, thus will cause the omission of character string matching; in addition, through word vector matching, the defect that the similarity between synonyms or similar words cannot be expressed in the character string matching result can be overcome, and mismatching is easily generated in single keyword matching.

As can be seen from the above, in the present embodiment, the shallow semantic information (i.e., the keywords and the keyword groups) and the deep semantic association information (i.e., the word vectors) of the text are combined, and based on the association characteristics of the sub-map in the knowledge map, fusion of each weight is implemented according to the distance between the matched vocabulary node and the root node, so as to implement text classification prediction of the text to be classified. Therefore, the relevance between the text to be classified and the text classification corresponding to each sub-image spectrum can be more comprehensively found, and the reliability and the accuracy of the text classification method are improved.

As an optional implementation, the method further comprises:

obtaining historical text classification information, wherein the historical text classification information comprises at least two historical classification texts and classification results corresponding to the at least two historical classification texts;

respectively extracting a first keyword and a first keyword group from the at least two historical classified texts, and acquiring word vectors of the first keyword and the first keyword group;

determining correlation coefficients between word vectors of first keywords and word vectors of first keyword groups extracted from target historical classified texts and the word vectors of the root nodes respectively by taking a target classification result as a root node to generate a sub-map corresponding to the target classification result, wherein the at least two historical classified texts comprise the target historical classified text, the target classification result is a classification result of the target historical classified text, and the preset knowledge map comprises a sub-map corresponding to the target classification result.

In implementation, the above-mentioned historical text classification information may be understood as: the text information that has been correctly classified may specifically include a history classified text according to which the text is classified, and a classification result obtained by analyzing the history classified text.

In this embodiment, the text classification method may be divided into two processes, as shown in fig. 2a, first, a knowledge graph is constructed based on historical text classification information to obtain the preset knowledge graph; and then carrying out knowledge reasoning on semantic contents of the text to be classified based on the preset knowledge graph so as to determine a classification result of the text to be classified according to a knowledge reasoning result.

As an alternative embodiment, the first keywords extracted from the target history classified text do not include common keywords, where the common keywords represent keywords having frequencies greater than or equal to a preset frequency in the history classified texts of different classification results respectively.

The frequency of occurrence of a certain vocabulary in the history classified text is greater than or equal to a preset frequency, and may represent that: the keywords extracted from the history classified text include the vocabulary, in other words, after the history classified text is subjected to Chinese word segmentation, if the frequency of the obtained vocabulary in the history classified text is very low, the vocabulary is not determined as the keywords.

In this embodiment, the common keywords represent keywords included in the historical classified texts of various classification results, and therefore, the common keywords have no reference value for text classification, so that the keywords are deleted from the first keywords, interference of the common keywords having no reference value on the text classification results is avoided, and accuracy of the text classification results can be improved.

The process of determining a sub-graph spectrum in the preset knowledge graph based on the historical text classification information can comprise the following steps:

step one, dividing the history classified text into a positive sample, a negative sample and a background sample.

In implementation, there may be a case that a classification result of a part of the history classified text is different from an initial classification result of the history classified text, or the classification result of the part of the history classified text does not belong to the target classification, and at this time, the history classified text may be divided into a positive sample, a negative sample, and a background sample in a process of constructing a sub-image spectrum corresponding to the target classification.

The positive sample represents that the classification result of the historical classification text is the same as the initial classification result of the historical classification text and is a target classification; the negative sample indicates that the classification result of the historical classified text is different from the initial classification result of the historical classified text, and the initial classification result of the historical classified text is a target classification; the background sample indicates that neither the classification result of the history classified text nor the initial classification result of the history classified text is the target classification.

For example: classifying a large number of events of the type A events, dividing the processed event text into two parts, wherein one part is that the event information is the type A event, and the type A event is determined after the analysis and judgment of workers, namely the type A event is a positive sample; and one part is that the event information is a class A event, and the worker determines whether the event is the class A event after analyzing and judging, namely, a negative and positive sample. The positive examples and the current event category are positively correlated, while the negative examples and the current event category are negatively correlated. And finally, taking the event texts of all other categories as background samples.

For example: as shown in fig. 2e, a large number of event texts that have been correctly classified are data-classified to divide a positive sample, a negative sample, and a background sample, which may include a variety of different classification results.

And step two, respectively preprocessing the positive sample, the negative sample and the background sample.

Wherein, the pretreatment can be understood as: chinese word segmentation processing and stop word processing. Specifically, in the chinese word segmentation process, the occurrence frequency and the part of speech of each segmented word in the event text may be counted and labeled, for example: vocabulary a appears 30 times in the event text and its part of speech is a verb, and it is labeled "vocabulary a, 30, v" to correspond to words, word frequency and part of speech, respectively. The annotation data is then trained or fine-tuned to segment the words "word A", "word B", "word C", and "word D" of a sentence. In the implementation, in order to improve the accuracy and the integrity of the Chinese word segmentation, a fixed word group can be artificially added into the word segmentation dictionary, for example, a word AB is added into the word segmentation dictionary, and the word segmentation result in the last sentence is as follows: "vocabulary AB", "vocabulary C", "vocabulary D". The stop word is a word that is used to remove the mood word, stop word, connection word, etc. in a sentence, for example: the words of "forehead", "kayi", "that", etc. may also be added continuously in the form of loading dictionary in practice to stop word list. And finally, obtaining keywords and keyword groups after the stop words are removed.

Of course, in the case of no fixed phrase, the vocabulary obtained by the chinese segmentation processing and the stop word processing may not include the key phrase.

And thirdly, respectively counting the number and the parts of speech of the keywords in the positive and negative samples, classifying according to the parts of speech of the keywords, and sequencing the keywords in each part of speech according to the word frequency.

And step four, counting the keywords of each category of texts in the background sample, determining the common keywords which are commonly appeared in all the categories of texts and have the highest word frequency, and deleting the common keywords from the keywords respectively counted in the positive sample, the negative sample and the positive sample.

In this step, in view of the fact that high-frequency words appearing in each type of text do not have the representativeness of the keywords, the method can be used for removing the high-frequency words with strong interference in the positive and negative samples, so that the high-frequency keywords finally retained in the positive and negative samples are words which can really reflect the characteristics of the event types.

And step five, according to word frequency sequencing of the keywords in the positive sample, performing cross combination on y different parts of speech with similar front and back sequences to obtain high-frequency word combinations with similar occurrence frequencies, then respectively reversing each high-frequency word combination to the historical classified text for verification, and determining the high-frequency word combinations as the keyword groups when the high-frequency word combinations appear in the original text.

In the step, words with different parts of speech but close frequency of occurrence in the historical classified text are combined in a cross mode to obtain a high-frequency word combination, and the high-frequency word combination is determined to be a key word group only when the sequence of the key words in the high-frequency word combination is verified to be correct and the interval distance is short in the historical classified text in a reverse mode.

For example: as shown in fig. 2f, assuming that y is equal to 5, the keywords with different parts of speech are ordered according to word frequency to obtain a verb sequence: v1, v2, v3, v4, v5, and noun sequences: n1, n2, n3, n4 and n5, the high-frequency vocabulary combination obtained by cross-combining the verbs in the verb sequence and the nouns in the noun sequence comprises: v1 n1, v1 n2, v1 n3 and the like, and finally, through reverse verification, the obtained key phrases comprise: v1 n1, v3n4, v4n5 and the like.

And step six, obtaining the keyword and the word vector of the keyword group, and adding the keyword group into the word segmentation dictionary so as to directly determine the character string as the keyword group in the subsequent application.

The process of obtaining the word vector of the keyword in this step is the same as the process of obtaining the word vector of a certain vocabulary in the prior art, and is not specifically described here, and the word vector of the keyword group is related to the word vector of the keyword, which may specifically refer to the process of determining the word vector of the keyword group in the embodiment shown in fig. 2 c.

And seventhly, respectively constructing sub-maps corresponding to each text classification.

Each sub-map takes the corresponding text classification as a root node, and takes the keyword (namely, the central word) with the highest frequency and the keyword group as the first-level sub-nodes according to the difference of parts of speech, and the attribute of each node in the sub-map comprises the following steps: part of speech, word frequency, and word vector. And then sequentially calculating the similarity of the word vectors of the central word and other keywords, and when the similarity is greater than a threshold value of 0.9, determining that the other keywords are related words (near-meaning words), so that the related words are directly connected with the central word in the subgraph spectrum, and the connection relationship between the central word and the other keywords takes the similarity of the word vectors as an attribute. And sequentially recursing to determine the connection relation between the first-level child node and each associated word, then taking the associated word as a second-level node, calculating the similarity of the word vectors of other keywords and the keywords of the second-level node, and determining the connection relation between the second-level node and other keywords based on the similarity until the similarity of the word vectors of each keyword is determined, thereby completing the construction of the sub-graph spectrum corresponding to the classification of the text.

In implementation, the sub-maps corresponding to each text classification are respectively constructed according to the above process to jointly form the preset knowledge map.

In implementation, a positive correlation relationship is formed between the keyword in the positive sample and the first-level node in the sub-graph corresponding to the current text category, and the positive correlation coefficient α is expressed by a word vector similarity (for example, α ═ cos (the word vector of the first-level node 1, the word vector of the keyword in the positive sample)). The keywords in the negative sample and the first-level node in the current text classification are in a negative correlation relationship, and the negative correlation coefficient β is represented by a difference between the similarity of the word vectors and 1 (β ═ cos (word vector of the first-level node 1, word vector of the keywords in the negative sample) -1). And the positive and negative correlation coefficients are used as the relation weight of the step-by-step connection. For example: a sub-map of the knowledge-map shown in figure 2 d.

The following exemplifies the text classification method provided in the embodiment of the present application, taking a history classification text as an event text as an example:

as shown in fig. 3, the text classification method includes the steps of:

step 301, extracting X key phrases from the event text to be classified.

The process of extracting the key phrase from the event text to be classified is the same as the process of extracting the key phrase from the historical classified text and determining the key phrase based on the key phrase, and the process is not repeated herein.

And step 302, respectively matching the vocabulary nodes of each sub-map in the preset knowledge map with each keyword group A _ i.

Wherein, the keyword group A _ i represents the ith keyword group, and the initial value of i is equal to X.

Step 303, when the vocabulary node of a certain sub-map is the same as the character string of the keyword group a _ i, determining that the keyword group is matched with the vocabulary node of the sub-map.

And step 304, determining that the character strings of the keyword group A _ i are matched with the vocabulary nodes of the sub-graph spectrum C _ i.

And 305, determining the confidence of the sub-graph spectrum C _ i including the confidence of the key phrase A _ i.

And step 306, subtracting 1 from the value of i, and judging whether the value of i is less than or equal to 0.

If the determination result in this step is yes, the update process of the confidence of the sub-map C _ i is ended, and step 307 is executed; otherwise, step 302 is repeated, i.e., the next keyword group in the X keyword groups is matched with the vocabulary node in the sub-graph spectrum C _ i.

Step 307, judging whether the maximum confidence included in the sub-graph spectrum C _ i is greater than a first threshold.

The maximum confidence included in the sub-graph spectrum C _ i may be represented as: c _ Max; the first threshold may be expressed as: thresh.

Determining the text classification of the sub-graph spectrum where the vocabulary node matched with the keyword group A _ i corresponding to the maximum confidence coefficient is positioned as the classification result of the event text to be classified under the condition that the judgment result in the step is 'yes'; otherwise, step 308 to step 311 are performed respectively.

And 308, determining a first weight under the condition that the word vectors of the key phrases extracted from the event text to be classified are matched with the word vectors of the vocabulary nodes.

The process of determining the first weight is the same as that in the embodiment shown in fig. 1, and is not described herein again.

Step 309, under the condition that the word vectors of the keywords extracted from the event text to be classified are matched with the word vectors of the vocabulary nodes, determining a second weight.

The process of determining the second weight is the same as that in the embodiment shown in fig. 1, and is not described herein again.

And 310, determining a third weight under the condition that the keywords extracted from the event text to be classified are matched with the character strings of the vocabulary nodes.

The process of determining the third weight is the same as that in the embodiment shown in fig. 1, and is not described herein again.

And 311, fusing the first weight, the second weight and the third weight.

In this step, as in the embodiment shown in fig. 1, the process of determining the fusion weight corresponding to each sub-graph spectrum in the preset knowledge graph based on the first weight, the second weight, the third weight, and the maximum value of the confidence of the text classification corresponding to each sub-graph spectrum in the preset knowledge graph is the same, and is not repeated here.

After step 311, when the target fusion weight is greater than the second threshold, it is determined that the text to be classified belongs to the text classification corresponding to the target fusion weight, where the target fusion weight is a maximum value of the fusion weights respectively corresponding to the sub-image spectra in the preset knowledge graph.

In the text classification method shown in fig. 3, the keyword group is sequentially matched with the character strings of the vocabulary nodes in the preset knowledge graph, and if the vocabulary node corresponding to the keyword group is in the sub-graph corresponding to a certain text classification, that is, the keyword group belongs to the sub-graph, the confidence of the keyword group is used as the confidence of the text classification. The loop counter initial value i is X-1 to loop sequentially. Firstly, circularly finishing the confidence calculation of the text classification corresponding to all the key phrases, obtaining the maximum value C _ max in the confidence of each text classification, and if the C _ max is larger than a first threshold thresh, directly outputting the text classification corresponding to the C _ max as the classification result of the text to be classified. In addition, if C _ max is not greater than the first threshold thresh, weights (i.e., a first weight, a second weight, and a third weight) for keyword matching, word vector similarity matching for keywords, and word group word vector similarity matching for keywords are calculated, respectively, that is, the first weight, the second weight, and the third weight are set, each set includes an element corresponding to each text classification, and finally, the weights of the three sets are weighted and fused, and are combined with C _ max, so as to determine a final text classification.

The weight fusion process may include:

step one, for each text classification, performing step-by-step weight multiplication (continuously multiplying correlation coefficients (namely positive and negative correlation coefficients) between matched nodes and root nodes) respectively according to the distance between the nodes matched with the keywords or the word vectors of the keyword groups or the character strings of the keywords and the correlation coefficients in the sub-maps corresponding to the text classification to obtain node distance weights;

step two, respectively calculating sigmoid (carrying out normalization processing) on the first weight, the second weight and the third weight, and then respectively multiplying the weights after the normalization processing by the corresponding node distance weights;

combining the text classification corresponding to the C _ max and the confidence coefficient of the text classification;

and fourthly, sequencing the confidence degrees of all the text classifications according to the sizes, wherein the category with the highest confidence degree and larger than a threshold thresh is the text classification to which the text to be classified belongs.

In the embodiment, aiming at the defect that missing matching or mismatching easily occurs in the event classification process based on only character string matching, the embodiment of the application takes keyword group matching as text matching of enhanced information on the basis of keyword matching, and combines matching of keywords and corresponding word vectors of the keyword groups as a leading factor, so that the finally obtained text classification result is more accurate.

It should be noted that, in the text classification method provided in the embodiment of the present application, the execution subject may be a text classification device, or a control module in the text classification device for executing the text classification method. In the embodiment of the present application, a text classification device executes a method for classifying loaded texts as an example, which illustrates the text classification device provided in the embodiment of the present application.

Referring to fig. 4, which is a structural diagram of a text classification device according to an embodiment of the present application, as shown in fig. 4, the text classification device 400 includes:

a first extraction module 401, configured to extract a keyword group in a text to be classified;

a first determining module 402, configured to determine confidence levels corresponding to M text classifications according to confidence levels corresponding to N key phrases in the to-be-classified texts when N key phrases matched with character strings of vocabulary nodes in a preset knowledge graph exist, where the M text classifications correspond to M sub-graphs in the preset knowledge graph, the character strings of the vocabulary nodes in the M sub-graphs are respectively matched with at least one key phrase in the N key phrases, and M and N are integers greater than or equal to 1;

a second determining module 403, configured to determine that the text to be classified belongs to the target text classification when a confidence of a target text classification in the M text classifications is greater than a first threshold, where the target text classification is a text classification with a highest confidence in the M text classifications.

Optionally, the first determining module 402 is specifically configured to

And under the condition that N key phrases matched with character strings of vocabulary nodes in a preset knowledge graph exist in the key phrases in the text to be classified, determining the maximum confidence corresponding to at least one key phrase matched with the character strings of the vocabulary nodes in the sub-graph as the confidence of the text classification corresponding to the sub-graph for each sub-graph in the M sub-graphs.

Optionally, in a case that the confidence of the target text classification is less than or equal to the first threshold, the text classification apparatus 400 further includes:

the third determining module is used for determining fusion weights corresponding to all sub-image spectrums in the preset knowledge graph respectively based on the first weight, the second weight, the third weight and the confidence coefficient of the target text classification;

a fourth determining module, configured to determine that the text to be classified belongs to the text classification corresponding to the target fusion weight when the target fusion weight is greater than a second threshold, where the target fusion weight is a maximum value in fusion weights respectively corresponding to sub-image spectra in the preset knowledge graph;

Optionally, the third determining module includes:

the normalization processing unit is used for performing normalization processing on the first weight, the second weight and the third weight;

the first data processing unit is used for adding the product of the first weight value after normalization processing and the first node distance weight value with the confidence coefficient of the target text classification to obtain a first value;

the second data processing unit is used for adding the product of the second weight value after the normalization processing and the second node distance weight value with the confidence coefficient of the target text classification to obtain a second value;

the third data processing unit is used for adding the product of the third weight value after the normalization processing and the third node distance weight value with the confidence coefficient of the target text classification to obtain a third value;

a determining unit, configured to determine a fusion weight associated with each sub-graph spectrum in the preset knowledge graph based on a maximum value of the first value, the second value, and the third value corresponding to a same text classification.

Optionally, the text classification apparatus 400 further includes:

the acquisition module is used for acquiring historical text classification information, wherein the historical text classification information comprises at least two historical classification texts and classification results corresponding to the at least two historical classification texts;

the second extraction module is used for respectively extracting a first keyword and a first keyword group from the at least two historical classified texts and acquiring word vectors of the first keyword and the first keyword group;

the generation module is used for determining correlation coefficients between word vectors of first key words and word vectors of first key word groups extracted from target historical classified texts and the word vectors of the root nodes respectively by taking a target classification result as a root node so as to generate sub-maps corresponding to the target classification result, wherein the at least two historical classified texts comprise the target historical classified text, the target classification result is a classification result of the target historical classified text, and the preset knowledge map comprises sub-maps corresponding to the target classification result.

Optionally, the first keywords extracted from the target history classified text do not include common keywords, where the common keywords represent keywords whose occurrence frequencies in the history classified texts with different classification results are greater than or equal to a preset frequency.

The text classification device provided in the embodiment of the present application can perform each process in the method embodiments shown in fig. 1 or fig. 3, and can obtain the same beneficial effects, and is not described herein again to avoid repetition.

The text classification device in the embodiment of the present application may be a device, or may be a component, an integrated circuit, or a chip in a terminal. The device can be mobile electronic equipment or non-mobile electronic equipment. By way of example, the mobile electronic device may be a mobile phone, a tablet computer, a notebook computer, a palm top computer, a vehicle-mounted electronic device, a wearable device, an ultra-mobile personal computer (UMPC), a netbook or a Personal Digital Assistant (PDA), and the like, and the non-mobile electronic device may be a server, a Network Attached Storage (NAS), a Personal Computer (PC), a Television (TV), a teller machine or a self-service machine, and the like, and the embodiments of the present application are not particularly limited.

Optionally, as shown in fig. 5, an electronic device 500 is further provided in this embodiment of the present application, and includes a processor 501, a memory 502, and a program or an instruction stored in the memory 502 and executable on the processor 501, where the program or the instruction is executed by the processor 501 to implement each process of the foregoing text classification method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

It should be noted that the electronic devices in the embodiments of the present application include the mobile electronic devices and the non-mobile electronic devices described above.

The embodiment of the present application further provides a readable storage medium, where a program or an instruction is stored on the readable storage medium, and when the program or the instruction is executed by a processor, the program or the instruction implements each process of the foregoing text classification method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

The processor is the processor in the electronic device described in the above embodiment. The readable storage medium includes a computer readable storage medium, such as a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and so on.

The embodiment of the present application further provides a chip, where the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is configured to run a program or an instruction to implement each process of the text classification method embodiment, and can achieve the same technical effect, and the details are not repeated here to avoid repetition.

It should be understood that the chips mentioned in the embodiments of the present application may also be referred to as system-on-chip, system-on-chip or system-on-chip, etc.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Further, it should be noted that the scope of the methods and apparatus of the embodiments of the present application is not limited to performing the functions in the order illustrated or discussed, but may include performing the functions in a substantially simultaneous manner or in a reverse order based on the functions involved, e.g., the methods described may be performed in an order different than that described, and various steps may be added, omitted, or combined. In addition, features described with reference to certain examples may be combined in other examples.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

While the present embodiments have been described with reference to the accompanying drawings, it is to be understood that the invention is not limited to the precise embodiments described above, which are meant to be illustrative and not restrictive, and that various changes may be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method of text classification, comprising:

extracting key phrases in the text to be classified;

2. The method according to claim 1, wherein determining confidence levels corresponding to M text classifications according to the confidence levels corresponding to the N keyword groups comprises:

and aiming at each sub-map in the M sub-maps, determining the maximum confidence corresponding to at least one keyword group matched with the character strings of the vocabulary nodes in the sub-map as the confidence of the text classification corresponding to the sub-map.

3. The method of claim 1, wherein in the case that the confidence level of the target text classification is less than or equal to the first threshold, the method further comprises:

4. The method according to claim 3, wherein the first weight is determined according to similarity between word vectors of the keywords in the text to be classified and word vectors of second vocabulary nodes in the preset knowledge graph, and a first node distance weight, and the first node distance weight is determined according to a correlation between the second vocabulary nodes and a root node, and the word vectors of the second vocabulary nodes are matched with the word vectors of the keywords in the text to be classified;

5. The method according to claim 4, wherein the determining a fusion weight corresponding to each sub-graph spectrum in the preset knowledge graph based on the first weight, the second weight, the third weight, and the confidence of the target text classification comprises:

normalizing the first weight, the second weight and the third weight;

6. The method of claim 1, further comprising:

7. The method according to claim 6, wherein the first keywords extracted from the target historical classified text do not include common keywords, wherein the common keywords represent keywords respectively having frequencies of occurrence greater than or equal to a preset frequency in the historical classified texts of different classification results.

8. A text classification apparatus, comprising:

9. An electronic device comprising a processor, a memory, and a program or instructions stored on the memory and executable on the processor, the program or instructions when executed by the processor implementing the steps of the text classification method according to any one of claims 1-7.

10. A readable storage medium, on which a program or instructions are stored which, when executed by a processor, carry out the steps of the text classification method according to any one of claims 1 to 7.