CN113254643A - Text classification method and device, electronic equipment and - Google Patents

Text classification method and device, electronic equipment and Download PDF

Info

Publication number
CN113254643A
CN113254643A CN202110591719.7A CN202110591719A CN113254643A CN 113254643 A CN113254643 A CN 113254643A CN 202110591719 A CN202110591719 A CN 202110591719A CN 113254643 A CN113254643 A CN 113254643A
Authority
CN
China
Prior art keywords
text
weight
classified
classification
sub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110591719.7A
Other languages
Chinese (zh)
Other versions
CN113254643B (en
Inventor
张启坤
吴瑧志
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Lynxi Technology Co Ltd
Original Assignee
Beijing Lynxi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Lynxi Technology Co Ltd filed Critical Beijing Lynxi Technology Co Ltd
Priority to CN202110591719.7A priority Critical patent/CN113254643B/en
Publication of CN113254643A publication Critical patent/CN113254643A/en
Application granted granted Critical
Publication of CN113254643B publication Critical patent/CN113254643B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

the application discloses a text classification method and device and electronic equipment. The text classification method comprises the following steps: extracting key phrases in the text to be classified; determining confidence degrees corresponding to M text classifications according to confidence degrees corresponding to the N key phrases under the condition that the N key phrases matched with character strings of vocabulary nodes in a preset knowledge graph exist in the key phrases in the texts to be classified, wherein the M text classifications correspond to M sub-graphs in the preset knowledge graph, the character strings of the vocabulary nodes in the M sub-graphs are respectively matched with at least one key phrase in the N key phrases, and M and N are positive integers; and under the condition that the confidence coefficient of a target text classification in the M text classifications is larger than a first threshold value, determining that the text to be classified belongs to the target text classification, wherein the target text classification is the text classification with the maximum confidence coefficient in the M text classifications. The method and the device for text classification can improve the efficiency of the text classification method.

Description

text classification method and device, electronic equipment and
Technical Field
The application belongs to the technical field of computers, and particularly relates to a text classification method and device and electronic equipment.
Background
With the proliferation of mobile communication devices, more and more unstructured or semi-structured data resources (e.g., textual information) are emerging. How to accurately and efficiently determine the category of unstructured or semi-structured text information becomes an urgent problem to be solved.
Disclosure of Invention
The embodiment of the application aims to provide a text classification method, a text classification device and electronic equipment, which can accurately and efficiently determine the category of text information.
In order to solve the technical problem, the present application is implemented as follows:
in a first aspect, an embodiment of the present application provides a text classification method, where the method includes:
extracting key phrases in the text to be classified;
determining confidence degrees corresponding to M text classifications according to confidence degrees corresponding to the N key phrases under the condition that the N key phrases are matched with character strings of vocabulary nodes in a preset knowledge graph in the texts to be classified, wherein the M text classifications correspond to M sub graphs in the preset knowledge graph, the character strings of the vocabulary nodes in the M sub graphs are respectively matched with at least one key phrase in the N key phrases, and M and N are integers which are larger than or equal to 1;
and under the condition that the confidence of a target text classification in the M text classifications is greater than a first threshold value, determining that the text to be classified belongs to the target text classification, wherein the target text classification is the text classification with the highest confidence in the M text classifications.
In a second aspect, an embodiment of the present application provides a text classification apparatus, including:
the first extraction module is used for extracting key phrases in the text to be classified;
the first determining module is used for determining confidence degrees corresponding to M text classifications according to the confidence degrees corresponding to N key phrases under the condition that the N key phrases are matched with the character strings of the vocabulary nodes in a preset knowledge graph in the to-be-classified texts, wherein the M text classifications correspond to M sub-graphs in the preset knowledge graph, the character strings of the vocabulary nodes in the M sub-graphs are respectively matched with at least one key phrase in the N key phrases, and M and N are integers larger than or equal to 1;
a second determining module, configured to determine that the text to be classified belongs to the target text classification when a confidence of a target text classification in the M text classifications is greater than a first threshold, where the target text classification is a text classification with a highest confidence among the M text classifications.
In a third aspect, an embodiment of the present application provides an electronic device, which includes a processor, a memory, and a program or instructions stored on the memory and executable on the processor, and when executed by the processor, the program or instructions implement the steps of the method according to the first aspect.
In a fourth aspect, embodiments of the present application provide a readable storage medium, on which a program or instructions are stored, which when executed by a processor implement the steps of the method according to the first aspect.
In a fifth aspect, an embodiment of the present application provides a chip, where the chip includes a processor and a communication interface, where the communication interface is coupled to the processor, and the processor is configured to execute a program or instructions to implement the method according to the first aspect.
In the embodiment of the application, extracting key phrases in the text to be classified; determining confidence degrees corresponding to M text classifications according to confidence degrees corresponding to the N key phrases under the condition that the N key phrases are matched with character strings of vocabulary nodes in a preset knowledge graph in the texts to be classified, wherein the M text classifications correspond to M sub graphs in the preset knowledge graph, the character strings of the vocabulary nodes in the M sub graphs are respectively matched with at least one key phrase in the N key phrases, and M and N are integers which are larger than or equal to 1; and under the condition that the confidence of a target text classification in the M text classifications is greater than a first threshold value, determining that the text to be classified belongs to the target text classification, wherein the target text classification is the text classification with the highest confidence in the M text classifications. Thus, the efficiency of text classification can be improved.
Drawings
Fig. 1 is a flowchart of a text classification method provided in an embodiment of the present application;
fig. 2a is an architecture diagram of a text classification method provided in an embodiment of the present application;
fig. 2b is a flowchart of extracting a keyword group from a text to be classified in a text classification method according to an embodiment of the present application;
fig. 2c is a schematic diagram illustrating a relationship between word vectors of keywords and word vectors of keyword groups in a text classification method according to an embodiment of the present application;
fig. 2d is a structural diagram of a preset knowledge graph in a text classification method according to an embodiment of the present application;
fig. 2e is a flowchart of extracting a key group from a history classified text in a text classification method according to an embodiment of the present application;
fig. 2f is a schematic diagram illustrating a relationship between a keyword and a keyword group in a text classification method according to an embodiment of the present application;
FIG. 3 is a flow chart of another text classification method provided by the embodiments of the present application;
fig. 4 is a structural diagram of a text classification apparatus according to an embodiment of the present application;
fig. 5 is a block diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The terms first, second and the like in the description and in the claims of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that embodiments of the application may be practiced in sequences other than those illustrated or described herein, and that the terms "first," "second," and the like are generally used herein in a generic sense and do not limit the number of terms, e.g., the first term can be one or more than one. In addition, "and/or" in the specification and claims means at least one of connected objects, a character "/" generally means that a preceding and succeeding related objects are in an "or" relationship.
The embodiment of the present application is applied to text classification, where the text to be classified may be unstructured or semi-structured text information, for example: short messages, voice recognition results in the phone, WeChat messages, mail, and the like. After the text is classified, relevant processing corresponding to the classification result of the text is conveniently carried out on the text in a targeted manner by staff or electronic equipment, so that the useful information can be technically and accurately acquired from the text.
In the related art, unstructured or semi-structured text information may be classified, so as to perform targeted processing on the classified text information, for example: the classification of events mostly depends on the empirical analysis of the staff. The general classification process is as follows: 1) empirically summarizing high-frequency vocabularies of different classifications; 2) searching high-frequency words in the text to be classified to roughly screen the text to be classified; 3) the text to be classified is finely classified by comparing the text to be classified with the keywords in the classified text, and different texts to be classified are divided into specific classification results; 4) and shunting the texts to be classified according to the fine classification result so as to perform key problem analysis on the texts corresponding to the classification result in a targeted manner.
Therefore, the processing flow of the text classification method in the related art is heavy and inefficient.
For convenience of description, in the embodiment of the present application, only the text to be classified is taken as an example, and the description is performed by taking:
for example: with the popularization of mobile communication devices, more and more events are collected to information processing departments in unstructured or semi-structured formats in the form of information such as telephone calls, short messages, WeChats, mails and the like. Thus, information handling departments will receive a large amount of text data resources each day, and these spoken text data resources may contain a lot of valuable event information, such as: personnel name, license plate number, address, etc. However, it is often difficult for a worker to obtain accurate and effective event information from a large amount of unstructured data resources, so analyzing and extracting key information from a large amount of text data resources is a topic with great research significance and application value.
In the current event classification, high-frequency words in various events are summarized based on working experience of workers, so that when a certain high-frequency word is found in a text to be classified, the event classification of the text to be classified is determined as the event classification corresponding to the high-frequency word.
However, many high-frequency words summarized by experience mostly belong to spoken language, it is difficult to exhaust various similar expressions of the high-frequency words, and the event information cannot be well represented, resulting in incomplete event screening, such as: the fact that the electric tricycle is stolen and the fact that the electric bicycle is stolen are similar event contents, but the spoken language expression modes are different, and therefore a worker cannot exhaust the multiple expression modes of the event contents according to experience, and therefore the event contents in the spoken language event text cannot be searched easily.
As can be seen from the above, in the related art, it is difficult for the staff to accurately judge the nature of the event (event category) in time for a large number of specific events generated every day, so that it is not easy for the staff to take measures corresponding to the event category. The method and the device for classifying the text information can be applied to classification of the text information, so that when the text classification method provided by the embodiment of the application is applied to classification of the events, efficiency and reliability of event classification can be improved.
The text classification method, the text classification device, the electronic device, and the readable storage medium according to the embodiments of the present application are described in detail below with reference to the accompanying drawings.
Referring to fig. 1, which is a flowchart of a text classification method according to an embodiment of the present application, as shown in fig. 1, the method may include the following steps:
step 101, extracting key phrases in the text to be classified.
In an implementation, the above keyword group representation includes a combination of at least two keywords, and the separation distance between the at least two keywords in the text to be classified is short, for example: the keyword set includes a noun and a verb, the first position of the noun in the text to be classified is located in the Z characters after the second position of the verb in the text to be classified.
The keyword group can express more accurate and complete semantics compared with the keywords. For example: assuming that only strings of keywords are matched, event a: finding wallet stolen on subway, and event B: the stolen apples in the bicycle basket are matched with the character string of the keyword 'stolen', but the event A indicates that the wallet is stolen, and the event B indicates that the fruits are stolen, and the properties of the two are completely different. In the embodiment of the present application, the following phrases: wallet/handbag/money + stolen to characterize event a and general keyword set: the apple/fruit basket/food + is stolen to represent the characteristics of the event B, so that the accuracy of the matching result obtained after the character string matching is carried out based on the keyword group is higher.
In practice, the above-mentioned key phrases may be determined by the following process:
performing word segmentation processing and word deactivation processing on a text to be classified to obtain a plurality of words in the text to be classified, and determining the occurrence frequency of each word in the text to be classified through word frequency statistics;
determining the participles with the occurrence frequency larger than the preset frequency as keywords;
the keywords are grouped according to their part of speech, for example: dividing the keywords into: verbs, nouns;
combining every two keywords with different parts of speech and similar occurrence frequencies to obtain a plurality of keyword combinations;
respectively searching each keyword combination in the text to be classified, and determining that the keyword combination is a keyword group when determining that the position information of two keywords in the keyword combinations meets a preset condition; wherein, the positional information of two keywords satisfies the preset condition, include: the precedence order of the two keywords is correct, for example: noun-first-verb-second or verb-first-noun, and the number of characters spaced between two keywords is small (e.g., the number of spaced characters is less than 5, 10, etc.).
102, under the condition that N key phrases matched with character strings of vocabulary nodes in a preset knowledge graph exist in the key phrases in the text to be classified, determining confidence degrees corresponding to M text classifications according to the confidence degrees corresponding to the N key phrases, wherein the M text classifications correspond to M sub-graphs in the preset knowledge graph, the character strings of the vocabulary nodes in the M sub-graphs are respectively matched with at least one key phrase in the N key phrases, and M and N are integers larger than or equal to 1.
The M text classifications and the M sub-maps are in one-to-one correspondence, and the character strings of the vocabulary nodes of any one of the M sub-maps are matched with at least one key phrase in the N key phrases.
Determining the confidence degrees corresponding to the M text classifications according to the confidence degrees corresponding to the N key phrases may include: for each sub-map in the M sub-maps, at least one keyword group matched with the character string of the vocabulary node thereof may be determined, and the maximum confidence corresponding to the keyword group in the matched at least one keyword group is determined as the confidence of the text classification corresponding to the sub-map.
In the related technology, a knowledge graph is constructed based on three parts, namely a keyword, a high-frequency phrase (namely the keyword phrase), a word vector of the keyword and the high-frequency phrase, the keyword is divided into a central word and a related word, the central word is generally the word or the phrase which appears most in an event, and the related word is generally a synonym or a similar word of the central word. The high-frequency phrase is generally composed of two related central words. The central word, the associated word and the high-frequency word group all exist in a graph node form, and each of the central word, the associated word and the high-frequency word group takes a self word vector as a characteristic attribute and takes a correlation relationship (word vector similarity) as an adjacent edge connection weight.
In this embodiment of the application, the preset knowledge graph may be determined based on historical text classification information, specifically, the keywords and the high-frequency phrases in the preset knowledge graph are extracted from the historical text classification information, and the historical text classification information includes text information and an actual classification result of the text information. In addition, for some vocabulary with low occurrence frequency extracted from the historical text classification information, the vocabulary can be added into the key phrase in a setting mode. For example: as shown in fig. 2a, after performing word segmentation processing and stop word extraction processing on the historical text classification information to obtain segmented vocabularies in the historical text classification information, keywords and keyword groups in the knowledge graph may be determined based on the occurrence frequency of the segmented vocabularies in the historical text classification information, and in this process, keyword groups may be added or deleted in the knowledge graph to update the knowledge graph according to the added or deleted keyword groups. Of course, as shown in fig. 2a, in implementation, keywords and keyword groups appearing in the text to be classified may also be added to the knowledge graph to update the knowledge graph.
The preset knowledge graph in the embodiment of the application comprises a plurality of sub-graphs, different sub-graphs take different text classifications (such as event classification) as root nodes, the key words and the high-frequency word groups are sequentially connected to form a sub-graph of the current text classification, and the whole preset knowledge graph is formed by all the sub-graph of the text classification. In other words, the M text classifications recorded in step 102 correspond to M sub-maps in the preset knowledge-map, and may be expressed as: the root nodes of the M sub-maps are respectively in one-to-one correspondence with the M text classifications.
It should be noted that, in the keyword groups in the text to be classified recorded in step 102, N keyword groups that are matched with the character strings of the vocabulary nodes in the preset knowledge graph exist, which can be understood as follows: each of the N keyword groups includes at least two keywords, and the at least two keywords are respectively the same as keywords associated with at least two vocabulary nodes connected in the preset knowledge graph.
In addition, in implementation, 1 or more keyword groups may be extracted from the text to be classified, and when the number of the keyword groups is multiple, each keyword group may be respectively matched with a vocabulary node in each sub-map, so that there may be a case where a vocabulary node in one sub-map is respectively matched with multiple keyword groups, that is, N may be greater than M, and at this time, the confidence corresponding to M text classifications is determined according to the confidence corresponding to N keyword groups, which may be understood as: and for each sub-atlas, taking the highest confidence level in the matched key phrases as the confidence level of the sub-atlas, namely as the confidence level of the text classification corresponding to the sub-atlas.
Of course, in practical application, there may be a case where the same keyword group in the text to be classified is respectively matched with the vocabulary nodes in the plurality of sub-maps, that is, N may also be smaller than M.
Step 103, under the condition that the confidence of a target text classification in the M text classifications is greater than a first threshold, determining that the text to be classified belongs to the target text classification, wherein the target text classification is a text classification with the highest confidence among the M text classifications.
Step 102 may be a recursive process, and in implementation, the confidence corresponding to each text classification may be determined (there may be a case where all vocabulary nodes in a certain sub-map are not matched with the keyword group of the text to be classified, at this time, the confidence of the certain sub-map is equal to 0 by default), and step 103 may determine that one of all text classifications in the preset knowledge map with the highest confidence is the target text classification, compare the confidence of the target text classification with a first threshold in numerical size, and determine that the text to be classified belongs to the target text classification when the confidence is greater than the first threshold.
In the embodiment of the application, extracting key phrases in the text to be classified; determining confidence degrees corresponding to M text classifications according to confidence degrees corresponding to the N key phrases under the condition that the N key phrases are matched with character strings of vocabulary nodes in a preset knowledge graph in the texts to be classified, wherein the M text classifications correspond to M sub graphs in the preset knowledge graph, the character strings of the vocabulary nodes in the M sub graphs are respectively matched with at least one key phrase in the N key phrases, and M and N are integers which are larger than or equal to 1; and under the condition that the confidence of a target text classification in the M text classifications is greater than a first threshold value, determining that the text to be classified belongs to the target text classification, wherein the target text classification is the text classification with the highest confidence in the M text classifications. In this way, in view of that the keyword groups can embody semantics better, the keyword groups in the text to be classified and the vocabulary nodes in the preset knowledge graph are subjected to character string matching, so that the obtained matching result can embody semantic relevance between the text to be classified and the vocabulary nodes better, and further, after the matching degree between each keyword group in the text to be classified and each vocabulary node in the preset knowledge graph is determined in a sequential recursion manner, the highest matching degree between the vocabulary node in which the sub-graph is matched with the keyword group in the text to be classified can be determined, and the text to be classified belongs to the text classification corresponding to the sub-graph is determined according to the highest matching degree, so that a worker does not need to manually compare the text to be classified and each text classification, and the text classification efficiency is improved.
As an optional implementation manner, in a case that the confidence of the target text classification is less than or equal to the first threshold, the method further includes:
determining fusion weights corresponding to the sub-image spectrums in the preset knowledge spectrum respectively based on the first weight, the second weight, the third weight and the confidence coefficient of the target text classification;
determining that the text to be classified belongs to the text classification corresponding to the target fusion weight when the target fusion weight is larger than a second threshold, wherein the target fusion weight is the maximum value of the fusion weights corresponding to the sub-image spectra in the preset knowledge map;
the first weight is used for representing similarity between word vectors of keywords in the text to be classified and word vectors of vocabulary nodes in the preset knowledge graph, the second weight is used for representing similarity between word vectors of the keyword groups in the text to be classified and word vectors of the vocabulary nodes in the preset knowledge graph, and the third weight is used for representing similarity between the keywords in the text to be classified and character strings of the vocabulary nodes in the preset knowledge graph.
In a specific implementation, the first weight is used to represent a similarity between a word vector of a keyword in the text to be classified and a word vector of a vocabulary node in the preset knowledge graph, and may be understood as: the similarity between the word vectors of the keywords in the text to be classified and the word vectors of the vocabulary nodes in the preset knowledge graph is positively correlated with the first weight; the second weight is used for representing the similarity between the word vector of the keyword group in the text to be classified and the word vector of the vocabulary node in the preset knowledge graph, and can be understood as follows: the second weight is positively correlated with the similarity between the word vectors of the key word groups in the text to be classified and the word vectors of the vocabulary nodes in the preset knowledge graph; the third weight is used for representing the similarity between the keywords in the text to be classified and the character strings of the vocabulary nodes in the preset knowledge graph, and can be understood as follows: and the third weight is positively correlated with the similarity between the keywords in the text to be classified and the character strings of the vocabulary nodes in the preset knowledge graph.
In implementation, the keyword group may be a combination of two keywords with close frequency of occurrence and different parts of speech in the text to be classified, for example: as shown in fig. 2b, in an implementation, text preprocessing, i.e. chinese word segmentation and word de-stop processing, may be performed on a text to be classified to obtain a plurality of keywords in the text to be classified, and then the keywords are classified according to parts of speech, for example: dividing the text to be classified into two types according to verbs and nouns, sequencing classified keywords of each part of speech according to the frequency of occurrence in the text to be classified, namely sequencing classified keywords of each part of speech according to word frequency, calculating the confidence coefficient of each keyword, then combining the two keywords with different parts of speech and close word frequency to obtain a keyword combination, after the keyword combination is screened out from the text to be classified in the forward direction through the process, reversely verifying whether the keyword combination is included in the text to be classified, and determining that the keyword combination is a keyword group under the condition that the keyword combination is included in the text to be classified. The reverse verification process may specifically be that positions of keywords in the keyword combination are respectively searched in the text to be classified, and when the front and back sequence of the two is correct and the interval characters are small, it is determined that the text to be classified includes the keyword combination.
In this embodiment, the fusion weight determined based on the first weight, the second weight, the third weight, and the confidence of the target text classification can comprehensively reflect the similarity of word vectors of the keywords between the corresponding sub-image spectrum and the text to be classified, the similarity of word vectors of the keyword groups, and the similarity of character strings, so that the text classification result determined based on the target fusion weight is more reliable.
Optionally, the first weight is determined according to similarity between word vectors of the keywords in the text to be classified and word vectors of second vocabulary nodes in the preset knowledge graph and a first node distance weight, the first node distance weight is determined according to a correlation between the second vocabulary nodes and a root node, and the word vectors of the second vocabulary nodes are matched with the word vectors of the keywords in the text to be classified;
the second weight is determined according to the similarity between the word vector of the keyword group and the word vector of a third word node in the preset knowledge graph and a second node distance weight, the second node distance weight is determined according to the correlation between the third word node and a root node, and the word vector of the third word node is matched with the word vector of the keyword group;
the third weight is determined according to the similarity between the keyword and a character string of a fourth vocabulary node in the preset knowledge graph and a third node distance weight, the third node distance weight is determined according to the correlation between the fourth vocabulary node and a root node, and the fourth vocabulary node character string is matched with the character string of the keyword.
In a specific implementation, the word vector of the keyword group may be determined according to the word vector and the confidence of the keyword included therein, for example: as shown in fig. 2C, assuming that the confidence of the keyword a is C _ a, the confidence of the keyword b is C _ b, the word vector of the keyword a is V _ a, and the word vector of the keyword b is V _ b, if the keyword a and the keyword b form a keyword group a-b, the confidence of the keyword group is: c _ a (1+ C _ b), and the word vector of the keyword group is: c _ a × V _ a + C _ b × V _ b.
In addition, the determining the distance weight of the first node based on the correlation between the second vocabulary node and the root node can be understood as follows: and performing multiplication on the correlation coefficient between the second vocabulary node and the root node to obtain the distance weight of the first node, for example: as shown in fig. 2d, assuming that the second vocabulary node is the related word a, the correlation coefficient between the vocabulary node and the root node a is: the product of a first correlation coefficient between the related word a and the center word B and a second correlation coefficient between the center word B and the root node a.
In addition, the word vector of the keyword is matched with the word vector of the second vocabulary node in the preset knowledge graph, and it can be understood that: the similarity between the word vector of the keyword and the word vector of the second vocabulary node is the largest, in other words, the similarity between the word vector of each keyword in the text to be classified and the vocabulary of each vocabulary node in the preset knowledge graph needs to be respectively determined, and the maximum value of the similarity is selected to determine the vocabulary node corresponding to the similarity with the maximum value as the second vocabulary node. For example: the first weight may be calculated by the following formula:
G1_c=sigmoid(max{i∈(1,n),j∈(1,m)|f(Vi,Wcj)}) ×sigmoid(max_id)
wherein, G1_ c represents a first weight value; sigmoid represents normalization processing, so that the value after normalization processing is between 0 and 1; n is the total number of the keywords in the text to be classified; viA word vector representing the ith keyword in the text to be classified; wcjThe word vectors represent the jth vocabulary node in the c sub-map in the preset knowledge map; f (V)i,Wcj) Shows to obtain ViAnd WcjA function of the similarity between; m represents the total number of vocabulary nodes in the c-th sub-map in the preset knowledge map; max _ id denotes: and under the condition that the similarity between the word vector of the ith keyword in the text to be classified and the word vector of the jth vocabulary node in the c sub-map in the preset knowledge map is the maximum value of the similarity between the word vector of each keyword node and the word vector of each vocabulary node in the preset knowledge map, determining the product of correlation coefficients between the jth vocabulary node and the root node of the c sub-map as a first node distance weight.
In practice, f (V) isi,Wcj) Can be expressed as f (V)i,Wcj)=cos(Vi,Wcj)。
Similarly, the word vector of the keyword group is matched with the word vector of the third vocabulary node in the preset knowledge graph, and it can be understood that: and acquiring the similarity between the word vector of each keyword group in the text to be classified and the word vector of each vocabulary node in the preset knowledge graph, selecting the corresponding vocabulary node with the maximum similarity as a third vocabulary node, acquiring the product of correlation coefficients between the third vocabulary node and the corresponding root node as a second node distance weight, and multiplying the second node distance weight by the similarity with the maximum value to obtain a second weight.
In addition, the matching of the keyword and the character string of the fourth vocabulary node in the preset knowledge graph can be understood as follows: and the characters of the keywords are the same as the characters of the node vocabulary corresponding to the fourth vocabulary node, so that the similarity of the character strings between the keywords and the fourth vocabulary node is determined to be more than 0, at the moment, the product of the correlation coefficients between the fourth vocabulary node and the corresponding root node is determined as a third node distance weight, and the third weight is determined as the product of the similarity of the character strings between the keywords and the fourth vocabulary node and the third node distance weight.
For example: calculating the similarity of character string matching by the following formula:
f(x,y)={1,if x=y;0,other}
and f (x, y) represents that whether the character strings of the keyword x and the vocabulary node y are the same or not is judged, under the condition of the same character strings, the value of f (x, y) is determined to be equal to 1, and otherwise, the value of f (x, y) is determined to be equal to 0.
For example: the first weight, the second weight and the third weight of each text classification can be fused through the following formulas to obtain a category confidence set:
Class_max
=Max(max{c∈(1,C)|G1_c×X1+G2_c×X2+G3_c×X3})
wherein, Class _ max represents the fusion weight, C represents the number of sub-image spectrums in the preset knowledge graph, namely the number of text classifications; g1_ c represents a first weight corresponding to the c-th sub-map in the preset knowledge map; g2_ c represents a second weight corresponding to the c-th sub-map in the preset knowledge map; g3_ c represents a third weight corresponding to the c-th sub-map in the preset knowledge map; x1 represents a weight value of the first weight; x2 represents a weight value of the second weight; x3 represents the weight value of the third weight.
In implementation, the values of X1, X2, and X3 may be preset or adjusted according to the application scene of the text to be classified, and only the sum of X1, X2, and X3 needs to be equal to 1, for example: x1 ═ 0.5; x2 ═ 0.3; x3 ═ 0.2.
After the category confidence set is obtained, the maximum confidence value of each text classification determined in the keyword group character string matching process may be added to an element corresponding to the text classification in the category confidence set to obtain the fusion weight.
For example: through the matching of the keyword group character strings, when the keyword group is matched with the character strings of the vocabulary nodes in a certain sub-map, determining a confidence set of the text classification corresponding to the sub-graph spectrum and a confidence degree comprising the key phrase, when a plurality of key word groups in the text to be classified are respectively matched with character strings of different vocabulary nodes in the same sub-map, the confidence set of the text classification corresponding to the sub-graph spectrum includes the respective confidences corresponding to the plurality of keyword groups, and the confidence coefficient with the maximum value is selected as the final confidence coefficient of the text classification corresponding to the sub-image spectrum, so that recursion is performed in sequence, so as to determine the final confidence degrees of the text classification respectively corresponding to each sub-map in the preset knowledge map and obtain the maximum value in the final confidence degrees, and under the condition that the maximum value in the final confidence degrees is larger than a second threshold value, determining that the text to be classified belongs to the text classification corresponding to the final confidence degree with the maximum value.
Of course, for the maximum value of the confidence of the text classification corresponding to each sub-graph spectrum in the preset knowledge graph based on the first weight, the second weight, the third weight and each sub-graph spectrum in the preset knowledge graph, the specific implementation manner of determining the fusion weight corresponding to each sub-graph spectrum in the preset knowledge graph may include multiple types, for example: and respectively carrying out weighted summation on the first weight, the second weight and the third weight, and the like.
As an optional implementation manner, the determining, based on the first weight, the second weight, the third weight, and the confidence of the target text classification, a fusion weight corresponding to each sub-graph spectrum in the preset knowledge graph includes:
normalizing the first weight, the second weight and the third weight;
adding the product of the first weight value after normalization processing and the first node distance weight value with the confidence coefficient of the target text classification to obtain a first value;
adding the product of the second weight value after normalization processing and the second node distance weight value with the confidence coefficient of the target text classification to obtain a second value;
adding the product of the third weight value after the normalization processing and the third node distance weight value with the confidence coefficient of the target text classification to obtain a third value;
determining a fusion weight value associated with each sub-graph spectrum in the preset knowledge graph based on a maximum value of the first value, the second value and the third value corresponding to the same text classification.
In a specific implementation, the process of weight fusion may include the following specific steps:
step one, for each text classification, performing step-by-step weight multiplication (continuously multiplying correlation coefficients (namely positive and negative correlation coefficients) between matched nodes and root nodes) respectively according to distances and correlation coefficients between nodes matched with keywords or keyword groups or character strings of the keywords in a sub-map corresponding to the text classification to obtain node distance weights (namely respectively solving a first node distance weight, a second node distance weight and a third node distance weight);
step two, respectively calculating sigmoid of the first weight, the second weight and the third weight (carrying out normalization processing), and then multiplying the normalized weights by corresponding node distance weights (the first weight corresponds to the first node distance weight, the second weight corresponds to the second node distance weight, and the third weight corresponds to the third node distance weight);
and step three, combining the text classification corresponding to the C _ max and the confidence coefficient of the text classification to obtain a fusion weight of the text classification.
In this embodiment, when the confidence of the target text classification is less than or equal to the first threshold, that is, the confidence of the text classification corresponding to all sub-maps in the preset knowledge map is less than or equal to the first threshold, which indicates that the extracted keyword group in the text to be classified cannot accurately describe the semantics of the text to be classified, or that there is no vocabulary in the preset knowledge map that is consistent with the extracted keyword group in the text to be classified, it is necessary to match the keyword, the word vector of the keyword group, and the character string of the keyword with the vocabulary node in the preset knowledge map in this embodiment.
The word vectors of the keywords and the keyword groups and the character strings of the keywords are respectively fused with the matching results of the vocabulary nodes in the preset knowledge graph, so that the problem that when only the character string matching results of the keywords are considered, other character strings with the same semantics cannot be included due to the keywords can be avoided, for example: "bicycle" is similar to the meaning of "bicycle", "bicycle" etc. but the character string is different, thus will cause the omission of character string matching; in addition, through word vector matching, the defect that the similarity between synonyms or similar words cannot be expressed in the character string matching result can be overcome, and mismatching is easily generated in single keyword matching.
As can be seen from the above, in the present embodiment, the shallow semantic information (i.e., the keywords and the keyword groups) and the deep semantic association information (i.e., the word vectors) of the text are combined, and based on the association characteristics of the sub-map in the knowledge map, fusion of each weight is implemented according to the distance between the matched vocabulary node and the root node, so as to implement text classification prediction of the text to be classified. Therefore, the relevance between the text to be classified and the text classification corresponding to each sub-image spectrum can be more comprehensively found, and the reliability and the accuracy of the text classification method are improved.
As an optional implementation, the method further comprises:
obtaining historical text classification information, wherein the historical text classification information comprises at least two historical classification texts and classification results corresponding to the at least two historical classification texts;
respectively extracting a first keyword and a first keyword group from the at least two historical classified texts, and acquiring word vectors of the first keyword and the first keyword group;
determining correlation coefficients between word vectors of first keywords and word vectors of first keyword groups extracted from target historical classified texts and the word vectors of the root nodes respectively by taking a target classification result as a root node to generate a sub-map corresponding to the target classification result, wherein the at least two historical classified texts comprise the target historical classified text, the target classification result is a classification result of the target historical classified text, and the preset knowledge map comprises a sub-map corresponding to the target classification result.
In implementation, the above-mentioned historical text classification information may be understood as: the text information that has been correctly classified may specifically include a history classified text according to which the text is classified, and a classification result obtained by analyzing the history classified text.
In this embodiment, the text classification method may be divided into two processes, as shown in fig. 2a, first, a knowledge graph is constructed based on historical text classification information to obtain the preset knowledge graph; and then carrying out knowledge reasoning on semantic contents of the text to be classified based on the preset knowledge graph so as to determine a classification result of the text to be classified according to a knowledge reasoning result.
As an alternative embodiment, the first keywords extracted from the target history classified text do not include common keywords, where the common keywords represent keywords having frequencies greater than or equal to a preset frequency in the history classified texts of different classification results respectively.
The frequency of occurrence of a certain vocabulary in the history classified text is greater than or equal to a preset frequency, and may represent that: the keywords extracted from the history classified text include the vocabulary, in other words, after the history classified text is subjected to Chinese word segmentation, if the frequency of the obtained vocabulary in the history classified text is very low, the vocabulary is not determined as the keywords.
In this embodiment, the common keywords represent keywords included in the historical classified texts of various classification results, and therefore, the common keywords have no reference value for text classification, so that the keywords are deleted from the first keywords, interference of the common keywords having no reference value on the text classification results is avoided, and accuracy of the text classification results can be improved.
The process of determining a sub-graph spectrum in the preset knowledge graph based on the historical text classification information can comprise the following steps:
step one, dividing the history classified text into a positive sample, a negative sample and a background sample.
In implementation, there may be a case that a classification result of a part of the history classified text is different from an initial classification result of the history classified text, or the classification result of the part of the history classified text does not belong to the target classification, and at this time, the history classified text may be divided into a positive sample, a negative sample, and a background sample in a process of constructing a sub-image spectrum corresponding to the target classification.
The positive sample represents that the classification result of the historical classification text is the same as the initial classification result of the historical classification text and is a target classification; the negative sample indicates that the classification result of the historical classified text is different from the initial classification result of the historical classified text, and the initial classification result of the historical classified text is a target classification; the background sample indicates that neither the classification result of the history classified text nor the initial classification result of the history classified text is the target classification.
For example: classifying a large number of events of the type A events, dividing the processed event text into two parts, wherein one part is that the event information is the type A event, and the type A event is determined after the analysis and judgment of workers, namely the type A event is a positive sample; and one part is that the event information is a class A event, and the worker determines whether the event is the class A event after analyzing and judging, namely, a negative and positive sample. The positive examples and the current event category are positively correlated, while the negative examples and the current event category are negatively correlated. And finally, taking the event texts of all other categories as background samples.
For example: as shown in fig. 2e, a large number of event texts that have been correctly classified are data-classified to divide a positive sample, a negative sample, and a background sample, which may include a variety of different classification results.
And step two, respectively preprocessing the positive sample, the negative sample and the background sample.
Wherein, the pretreatment can be understood as: chinese word segmentation processing and stop word processing. Specifically, in the chinese word segmentation process, the occurrence frequency and the part of speech of each segmented word in the event text may be counted and labeled, for example: vocabulary a appears 30 times in the event text and its part of speech is a verb, and it is labeled "vocabulary a, 30, v" to correspond to words, word frequency and part of speech, respectively. The annotation data is then trained or fine-tuned to segment the words "word A", "word B", "word C", and "word D" of a sentence. In the implementation, in order to improve the accuracy and the integrity of the Chinese word segmentation, a fixed word group can be artificially added into the word segmentation dictionary, for example, a word AB is added into the word segmentation dictionary, and the word segmentation result in the last sentence is as follows: "vocabulary AB", "vocabulary C", "vocabulary D". The stop word is a word that is used to remove the mood word, stop word, connection word, etc. in a sentence, for example: the words of "forehead", "kayi", "that", etc. may also be added continuously in the form of loading dictionary in practice to stop word list. And finally, obtaining keywords and keyword groups after the stop words are removed.
Of course, in the case of no fixed phrase, the vocabulary obtained by the chinese segmentation processing and the stop word processing may not include the key phrase.
And thirdly, respectively counting the number and the parts of speech of the keywords in the positive and negative samples, classifying according to the parts of speech of the keywords, and sequencing the keywords in each part of speech according to the word frequency.
And step four, counting the keywords of each category of texts in the background sample, determining the common keywords which are commonly appeared in all the categories of texts and have the highest word frequency, and deleting the common keywords from the keywords respectively counted in the positive sample, the negative sample and the positive sample.
In this step, in view of the fact that high-frequency words appearing in each type of text do not have the representativeness of the keywords, the method can be used for removing the high-frequency words with strong interference in the positive and negative samples, so that the high-frequency keywords finally retained in the positive and negative samples are words which can really reflect the characteristics of the event types.
And step five, according to word frequency sequencing of the keywords in the positive sample, performing cross combination on y different parts of speech with similar front and back sequences to obtain high-frequency word combinations with similar occurrence frequencies, then respectively reversing each high-frequency word combination to the historical classified text for verification, and determining the high-frequency word combinations as the keyword groups when the high-frequency word combinations appear in the original text.
In the step, words with different parts of speech but close frequency of occurrence in the historical classified text are combined in a cross mode to obtain a high-frequency word combination, and the high-frequency word combination is determined to be a key word group only when the sequence of the key words in the high-frequency word combination is verified to be correct and the interval distance is short in the historical classified text in a reverse mode.
For example: as shown in fig. 2f, assuming that y is equal to 5, the keywords with different parts of speech are ordered according to word frequency to obtain a verb sequence: v1, v2, v3, v4, v5, and noun sequences: n1, n2, n3, n4 and n5, the high-frequency vocabulary combination obtained by cross-combining the verbs in the verb sequence and the nouns in the noun sequence comprises: v1 n1, v1 n2, v1 n3 and the like, and finally, through reverse verification, the obtained key phrases comprise: v1 n1, v3n4, v4n5 and the like.
And step six, obtaining the keyword and the word vector of the keyword group, and adding the keyword group into the word segmentation dictionary so as to directly determine the character string as the keyword group in the subsequent application.
The process of obtaining the word vector of the keyword in this step is the same as the process of obtaining the word vector of a certain vocabulary in the prior art, and is not specifically described here, and the word vector of the keyword group is related to the word vector of the keyword, which may specifically refer to the process of determining the word vector of the keyword group in the embodiment shown in fig. 2 c.
And seventhly, respectively constructing sub-maps corresponding to each text classification.
Each sub-map takes the corresponding text classification as a root node, and takes the keyword (namely, the central word) with the highest frequency and the keyword group as the first-level sub-nodes according to the difference of parts of speech, and the attribute of each node in the sub-map comprises the following steps: part of speech, word frequency, and word vector. And then sequentially calculating the similarity of the word vectors of the central word and other keywords, and when the similarity is greater than a threshold value of 0.9, determining that the other keywords are related words (near-meaning words), so that the related words are directly connected with the central word in the subgraph spectrum, and the connection relationship between the central word and the other keywords takes the similarity of the word vectors as an attribute. And sequentially recursing to determine the connection relation between the first-level child node and each associated word, then taking the associated word as a second-level node, calculating the similarity of the word vectors of other keywords and the keywords of the second-level node, and determining the connection relation between the second-level node and other keywords based on the similarity until the similarity of the word vectors of each keyword is determined, thereby completing the construction of the sub-graph spectrum corresponding to the classification of the text.
In implementation, the sub-maps corresponding to each text classification are respectively constructed according to the above process to jointly form the preset knowledge map.
In implementation, a positive correlation relationship is formed between the keyword in the positive sample and the first-level node in the sub-graph corresponding to the current text category, and the positive correlation coefficient α is expressed by a word vector similarity (for example, α ═ cos (the word vector of the first-level node 1, the word vector of the keyword in the positive sample)). The keywords in the negative sample and the first-level node in the current text classification are in a negative correlation relationship, and the negative correlation coefficient β is represented by a difference between the similarity of the word vectors and 1 (β ═ cos (word vector of the first-level node 1, word vector of the keywords in the negative sample) -1). And the positive and negative correlation coefficients are used as the relation weight of the step-by-step connection. For example: a sub-map of the knowledge-map shown in figure 2 d.
The following exemplifies the text classification method provided in the embodiment of the present application, taking a history classification text as an event text as an example:
as shown in fig. 3, the text classification method includes the steps of:
step 301, extracting X key phrases from the event text to be classified.
The process of extracting the key phrase from the event text to be classified is the same as the process of extracting the key phrase from the historical classified text and determining the key phrase based on the key phrase, and the process is not repeated herein.
And step 302, respectively matching the vocabulary nodes of each sub-map in the preset knowledge map with each keyword group A _ i.
Wherein, the keyword group A _ i represents the ith keyword group, and the initial value of i is equal to X.
Step 303, when the vocabulary node of a certain sub-map is the same as the character string of the keyword group a _ i, determining that the keyword group is matched with the vocabulary node of the sub-map.
And step 304, determining that the character strings of the keyword group A _ i are matched with the vocabulary nodes of the sub-graph spectrum C _ i.
And 305, determining the confidence of the sub-graph spectrum C _ i including the confidence of the key phrase A _ i.
And step 306, subtracting 1 from the value of i, and judging whether the value of i is less than or equal to 0.
If the determination result in this step is yes, the update process of the confidence of the sub-map C _ i is ended, and step 307 is executed; otherwise, step 302 is repeated, i.e., the next keyword group in the X keyword groups is matched with the vocabulary node in the sub-graph spectrum C _ i.
Step 307, judging whether the maximum confidence included in the sub-graph spectrum C _ i is greater than a first threshold.
The maximum confidence included in the sub-graph spectrum C _ i may be represented as: c _ Max; the first threshold may be expressed as: thresh.
Determining the text classification of the sub-graph spectrum where the vocabulary node matched with the keyword group A _ i corresponding to the maximum confidence coefficient is positioned as the classification result of the event text to be classified under the condition that the judgment result in the step is 'yes'; otherwise, step 308 to step 311 are performed respectively.
And 308, determining a first weight under the condition that the word vectors of the key phrases extracted from the event text to be classified are matched with the word vectors of the vocabulary nodes.
The process of determining the first weight is the same as that in the embodiment shown in fig. 1, and is not described herein again.
Step 309, under the condition that the word vectors of the keywords extracted from the event text to be classified are matched with the word vectors of the vocabulary nodes, determining a second weight.
The process of determining the second weight is the same as that in the embodiment shown in fig. 1, and is not described herein again.
And 310, determining a third weight under the condition that the keywords extracted from the event text to be classified are matched with the character strings of the vocabulary nodes.
The process of determining the third weight is the same as that in the embodiment shown in fig. 1, and is not described herein again.
And 311, fusing the first weight, the second weight and the third weight.
In this step, as in the embodiment shown in fig. 1, the process of determining the fusion weight corresponding to each sub-graph spectrum in the preset knowledge graph based on the first weight, the second weight, the third weight, and the maximum value of the confidence of the text classification corresponding to each sub-graph spectrum in the preset knowledge graph is the same, and is not repeated here.
After step 311, when the target fusion weight is greater than the second threshold, it is determined that the text to be classified belongs to the text classification corresponding to the target fusion weight, where the target fusion weight is a maximum value of the fusion weights respectively corresponding to the sub-image spectra in the preset knowledge graph.
In the text classification method shown in fig. 3, the keyword group is sequentially matched with the character strings of the vocabulary nodes in the preset knowledge graph, and if the vocabulary node corresponding to the keyword group is in the sub-graph corresponding to a certain text classification, that is, the keyword group belongs to the sub-graph, the confidence of the keyword group is used as the confidence of the text classification. The loop counter initial value i is X-1 to loop sequentially. Firstly, circularly finishing the confidence calculation of the text classification corresponding to all the key phrases, obtaining the maximum value C _ max in the confidence of each text classification, and if the C _ max is larger than a first threshold thresh, directly outputting the text classification corresponding to the C _ max as the classification result of the text to be classified. In addition, if C _ max is not greater than the first threshold thresh, weights (i.e., a first weight, a second weight, and a third weight) for keyword matching, word vector similarity matching for keywords, and word group word vector similarity matching for keywords are calculated, respectively, that is, the first weight, the second weight, and the third weight are set, each set includes an element corresponding to each text classification, and finally, the weights of the three sets are weighted and fused, and are combined with C _ max, so as to determine a final text classification.
The weight fusion process may include:
step one, for each text classification, performing step-by-step weight multiplication (continuously multiplying correlation coefficients (namely positive and negative correlation coefficients) between matched nodes and root nodes) respectively according to the distance between the nodes matched with the keywords or the word vectors of the keyword groups or the character strings of the keywords and the correlation coefficients in the sub-maps corresponding to the text classification to obtain node distance weights;
step two, respectively calculating sigmoid (carrying out normalization processing) on the first weight, the second weight and the third weight, and then respectively multiplying the weights after the normalization processing by the corresponding node distance weights;
combining the text classification corresponding to the C _ max and the confidence coefficient of the text classification;
and fourthly, sequencing the confidence degrees of all the text classifications according to the sizes, wherein the category with the highest confidence degree and larger than a threshold thresh is the text classification to which the text to be classified belongs.
In the embodiment, aiming at the defect that missing matching or mismatching easily occurs in the event classification process based on only character string matching, the embodiment of the application takes keyword group matching as text matching of enhanced information on the basis of keyword matching, and combines matching of keywords and corresponding word vectors of the keyword groups as a leading factor, so that the finally obtained text classification result is more accurate.
It should be noted that, in the text classification method provided in the embodiment of the present application, the execution subject may be a text classification device, or a control module in the text classification device for executing the text classification method. In the embodiment of the present application, a text classification device executes a method for classifying loaded texts as an example, which illustrates the text classification device provided in the embodiment of the present application.
Referring to fig. 4, which is a structural diagram of a text classification device according to an embodiment of the present application, as shown in fig. 4, the text classification device 400 includes:
a first extraction module 401, configured to extract a keyword group in a text to be classified;
a first determining module 402, configured to determine confidence levels corresponding to M text classifications according to confidence levels corresponding to N key phrases in the to-be-classified texts when N key phrases matched with character strings of vocabulary nodes in a preset knowledge graph exist, where the M text classifications correspond to M sub-graphs in the preset knowledge graph, the character strings of the vocabulary nodes in the M sub-graphs are respectively matched with at least one key phrase in the N key phrases, and M and N are integers greater than or equal to 1;
a second determining module 403, configured to determine that the text to be classified belongs to the target text classification when a confidence of a target text classification in the M text classifications is greater than a first threshold, where the target text classification is a text classification with a highest confidence in the M text classifications.
Optionally, the first determining module 402 is specifically configured to
And under the condition that N key phrases matched with character strings of vocabulary nodes in a preset knowledge graph exist in the key phrases in the text to be classified, determining the maximum confidence corresponding to at least one key phrase matched with the character strings of the vocabulary nodes in the sub-graph as the confidence of the text classification corresponding to the sub-graph for each sub-graph in the M sub-graphs.
Optionally, in a case that the confidence of the target text classification is less than or equal to the first threshold, the text classification apparatus 400 further includes:
the third determining module is used for determining fusion weights corresponding to all sub-image spectrums in the preset knowledge graph respectively based on the first weight, the second weight, the third weight and the confidence coefficient of the target text classification;
a fourth determining module, configured to determine that the text to be classified belongs to the text classification corresponding to the target fusion weight when the target fusion weight is greater than a second threshold, where the target fusion weight is a maximum value in fusion weights respectively corresponding to sub-image spectra in the preset knowledge graph;
the first weight is used for representing similarity between word vectors of keywords in the text to be classified and word vectors of vocabulary nodes in the preset knowledge graph, the second weight is used for representing similarity between word vectors of the keyword groups in the text to be classified and word vectors of the vocabulary nodes in the preset knowledge graph, and the third weight is used for representing similarity between the keywords in the text to be classified and character strings of the vocabulary nodes in the preset knowledge graph.
Optionally, the first weight is determined according to similarity between word vectors of the keywords in the text to be classified and word vectors of second vocabulary nodes in the preset knowledge graph and a first node distance weight, the first node distance weight is determined according to a correlation between the second vocabulary nodes and a root node, and the word vectors of the second vocabulary nodes are matched with the word vectors of the keywords in the text to be classified;
the second weight is determined according to the similarity between the word vector of the keyword group and the word vector of a third word node in the preset knowledge graph and a second node distance weight, the second node distance weight is determined according to the correlation between the third word node and a root node, and the word vector of the third word node is matched with the word vector of the keyword group;
the third weight is determined according to the similarity between the keyword and a character string of a fourth vocabulary node in the preset knowledge graph and a third node distance weight, the third node distance weight is determined according to the correlation between the fourth vocabulary node and a root node, and the fourth vocabulary node character string is matched with the character string of the keyword.
Optionally, the third determining module includes:
the normalization processing unit is used for performing normalization processing on the first weight, the second weight and the third weight;
the first data processing unit is used for adding the product of the first weight value after normalization processing and the first node distance weight value with the confidence coefficient of the target text classification to obtain a first value;
the second data processing unit is used for adding the product of the second weight value after the normalization processing and the second node distance weight value with the confidence coefficient of the target text classification to obtain a second value;
the third data processing unit is used for adding the product of the third weight value after the normalization processing and the third node distance weight value with the confidence coefficient of the target text classification to obtain a third value;
a determining unit, configured to determine a fusion weight associated with each sub-graph spectrum in the preset knowledge graph based on a maximum value of the first value, the second value, and the third value corresponding to a same text classification.
Optionally, the text classification apparatus 400 further includes:
the acquisition module is used for acquiring historical text classification information, wherein the historical text classification information comprises at least two historical classification texts and classification results corresponding to the at least two historical classification texts;
the second extraction module is used for respectively extracting a first keyword and a first keyword group from the at least two historical classified texts and acquiring word vectors of the first keyword and the first keyword group;
the generation module is used for determining correlation coefficients between word vectors of first key words and word vectors of first key word groups extracted from target historical classified texts and the word vectors of the root nodes respectively by taking a target classification result as a root node so as to generate sub-maps corresponding to the target classification result, wherein the at least two historical classified texts comprise the target historical classified text, the target classification result is a classification result of the target historical classified text, and the preset knowledge map comprises sub-maps corresponding to the target classification result.
Optionally, the first keywords extracted from the target history classified text do not include common keywords, where the common keywords represent keywords whose occurrence frequencies in the history classified texts with different classification results are greater than or equal to a preset frequency.
The text classification device provided in the embodiment of the present application can perform each process in the method embodiments shown in fig. 1 or fig. 3, and can obtain the same beneficial effects, and is not described herein again to avoid repetition.
The text classification device in the embodiment of the present application may be a device, or may be a component, an integrated circuit, or a chip in a terminal. The device can be mobile electronic equipment or non-mobile electronic equipment. By way of example, the mobile electronic device may be a mobile phone, a tablet computer, a notebook computer, a palm top computer, a vehicle-mounted electronic device, a wearable device, an ultra-mobile personal computer (UMPC), a netbook or a Personal Digital Assistant (PDA), and the like, and the non-mobile electronic device may be a server, a Network Attached Storage (NAS), a Personal Computer (PC), a Television (TV), a teller machine or a self-service machine, and the like, and the embodiments of the present application are not particularly limited.
Optionally, as shown in fig. 5, an electronic device 500 is further provided in this embodiment of the present application, and includes a processor 501, a memory 502, and a program or an instruction stored in the memory 502 and executable on the processor 501, where the program or the instruction is executed by the processor 501 to implement each process of the foregoing text classification method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.
It should be noted that the electronic devices in the embodiments of the present application include the mobile electronic devices and the non-mobile electronic devices described above.
The embodiment of the present application further provides a readable storage medium, where a program or an instruction is stored on the readable storage medium, and when the program or the instruction is executed by a processor, the program or the instruction implements each process of the foregoing text classification method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.
The processor is the processor in the electronic device described in the above embodiment. The readable storage medium includes a computer readable storage medium, such as a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and so on.
The embodiment of the present application further provides a chip, where the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is configured to run a program or an instruction to implement each process of the text classification method embodiment, and can achieve the same technical effect, and the details are not repeated here to avoid repetition.
It should be understood that the chips mentioned in the embodiments of the present application may also be referred to as system-on-chip, system-on-chip or system-on-chip, etc.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Further, it should be noted that the scope of the methods and apparatus of the embodiments of the present application is not limited to performing the functions in the order illustrated or discussed, but may include performing the functions in a substantially simultaneous manner or in a reverse order based on the functions involved, e.g., the methods described may be performed in an order different than that described, and various steps may be added, omitted, or combined. In addition, features described with reference to certain examples may be combined in other examples.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.
While the present embodiments have been described with reference to the accompanying drawings, it is to be understood that the invention is not limited to the precise embodiments described above, which are meant to be illustrative and not restrictive, and that various changes may be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (10)

1. A method of text classification, comprising:
extracting key phrases in the text to be classified;
determining confidence degrees corresponding to M text classifications according to confidence degrees corresponding to the N key phrases under the condition that the N key phrases are matched with character strings of vocabulary nodes in a preset knowledge graph in the texts to be classified, wherein the M text classifications correspond to M sub graphs in the preset knowledge graph, the character strings of the vocabulary nodes in the M sub graphs are respectively matched with at least one key phrase in the N key phrases, and M and N are integers which are larger than or equal to 1;
and under the condition that the confidence of a target text classification in the M text classifications is greater than a first threshold value, determining that the text to be classified belongs to the target text classification, wherein the target text classification is the text classification with the highest confidence in the M text classifications.
2. The method according to claim 1, wherein determining confidence levels corresponding to M text classifications according to the confidence levels corresponding to the N keyword groups comprises:
and aiming at each sub-map in the M sub-maps, determining the maximum confidence corresponding to at least one keyword group matched with the character strings of the vocabulary nodes in the sub-map as the confidence of the text classification corresponding to the sub-map.
3. The method of claim 1, wherein in the case that the confidence level of the target text classification is less than or equal to the first threshold, the method further comprises:
determining fusion weights corresponding to the sub-image spectrums in the preset knowledge spectrum respectively based on the first weight, the second weight, the third weight and the confidence coefficient of the target text classification;
determining that the text to be classified belongs to the text classification corresponding to the target fusion weight when the target fusion weight is larger than a second threshold, wherein the target fusion weight is the maximum value of the fusion weights corresponding to the sub-image spectra in the preset knowledge map;
the first weight is used for representing similarity between word vectors of keywords in the text to be classified and word vectors of vocabulary nodes in the preset knowledge graph, the second weight is used for representing similarity between word vectors of the keyword groups in the text to be classified and word vectors of the vocabulary nodes in the preset knowledge graph, and the third weight is used for representing similarity between the keywords in the text to be classified and character strings of the vocabulary nodes in the preset knowledge graph.
4. The method according to claim 3, wherein the first weight is determined according to similarity between word vectors of the keywords in the text to be classified and word vectors of second vocabulary nodes in the preset knowledge graph, and a first node distance weight, and the first node distance weight is determined according to a correlation between the second vocabulary nodes and a root node, and the word vectors of the second vocabulary nodes are matched with the word vectors of the keywords in the text to be classified;
the second weight is determined according to the similarity between the word vector of the keyword group and the word vector of a third word node in the preset knowledge graph and a second node distance weight, the second node distance weight is determined according to the correlation between the third word node and a root node, and the word vector of the third word node is matched with the word vector of the keyword group;
the third weight is determined according to the similarity between the keyword and a character string of a fourth vocabulary node in the preset knowledge graph and a third node distance weight, the third node distance weight is determined according to the correlation between the fourth vocabulary node and a root node, and the fourth vocabulary node character string is matched with the character string of the keyword.
5. The method according to claim 4, wherein the determining a fusion weight corresponding to each sub-graph spectrum in the preset knowledge graph based on the first weight, the second weight, the third weight, and the confidence of the target text classification comprises:
normalizing the first weight, the second weight and the third weight;
adding the product of the first weight value after normalization processing and the first node distance weight value with the confidence coefficient of the target text classification to obtain a first value;
adding the product of the second weight value after normalization processing and the second node distance weight value with the confidence coefficient of the target text classification to obtain a second value;
adding the product of the third weight value after the normalization processing and the third node distance weight value with the confidence coefficient of the target text classification to obtain a third value;
determining a fusion weight value associated with each sub-graph spectrum in the preset knowledge graph based on a maximum value of the first value, the second value and the third value corresponding to the same text classification.
6. The method of claim 1, further comprising:
obtaining historical text classification information, wherein the historical text classification information comprises at least two historical classification texts and classification results corresponding to the at least two historical classification texts;
respectively extracting a first keyword and a first keyword group from the at least two historical classified texts, and acquiring word vectors of the first keyword and the first keyword group;
determining correlation coefficients between word vectors of first keywords and word vectors of first keyword groups extracted from target historical classified texts and the word vectors of the root nodes respectively by taking a target classification result as a root node to generate a sub-map corresponding to the target classification result, wherein the at least two historical classified texts comprise the target historical classified text, the target classification result is a classification result of the target historical classified text, and the preset knowledge map comprises a sub-map corresponding to the target classification result.
7. The method according to claim 6, wherein the first keywords extracted from the target historical classified text do not include common keywords, wherein the common keywords represent keywords respectively having frequencies of occurrence greater than or equal to a preset frequency in the historical classified texts of different classification results.
8. A text classification apparatus, comprising:
the first extraction module is used for extracting key phrases in the text to be classified;
the first determining module is used for determining confidence degrees corresponding to M text classifications according to the confidence degrees corresponding to N key phrases under the condition that the N key phrases are matched with the character strings of the vocabulary nodes in a preset knowledge graph in the to-be-classified texts, wherein the M text classifications correspond to M sub-graphs in the preset knowledge graph, the character strings of the vocabulary nodes in the M sub-graphs are respectively matched with at least one key phrase in the N key phrases, and M and N are integers larger than or equal to 1;
a second determining module, configured to determine that the text to be classified belongs to the target text classification when a confidence of a target text classification in the M text classifications is greater than a first threshold, where the target text classification is a text classification with a highest confidence among the M text classifications.
9. An electronic device comprising a processor, a memory, and a program or instructions stored on the memory and executable on the processor, the program or instructions when executed by the processor implementing the steps of the text classification method according to any one of claims 1-7.
10. A readable storage medium, on which a program or instructions are stored which, when executed by a processor, carry out the steps of the text classification method according to any one of claims 1 to 7.
CN202110591719.7A 2021-05-28 2021-05-28 Text classification method and device, electronic equipment and text classification program Active CN113254643B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110591719.7A CN113254643B (en) 2021-05-28 2021-05-28 Text classification method and device, electronic equipment and text classification program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110591719.7A CN113254643B (en) 2021-05-28 2021-05-28 Text classification method and device, electronic equipment and text classification program

Publications (2)

Publication Number Publication Date
CN113254643A true CN113254643A (en) 2021-08-13
CN113254643B CN113254643B (en) 2023-10-27

Family

ID=77185173

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110591719.7A Active CN113254643B (en) 2021-05-28 2021-05-28 Text classification method and device, electronic equipment and text classification program

Country Status (1)

Country Link
CN (1) CN113254643B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113779259A (en) * 2021-11-15 2021-12-10 太平金融科技服务(上海)有限公司 Text classification method and device, computer equipment and storage medium
CN114038542A (en) * 2021-10-12 2022-02-11 吉林医药学院 Medical information sharing method and system based on medical big data
CN114049505A (en) * 2021-10-11 2022-02-15 数采小博科技发展有限公司 Method, device, equipment and medium for matching and identifying commodities
CN117150046A (en) * 2023-09-12 2023-12-01 广东省华南技术转移中心有限公司 Automatic task decomposition method and system based on context semantics

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160063093A1 (en) * 2014-08-27 2016-03-03 Facebook, Inc. Keyword Search Queries on Online Social Networks
CN111428044A (en) * 2020-03-06 2020-07-17 中国平安人寿保险股份有限公司 Method, device, equipment and storage medium for obtaining supervision identification result in multiple modes
WO2021042503A1 (en) * 2019-09-06 2021-03-11 平安科技(深圳)有限公司 Information classification extraction method, apparatus, computer device and storage medium
CN112765357A (en) * 2021-02-05 2021-05-07 北京灵汐科技有限公司 Text classification method and device and electronic equipment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160063093A1 (en) * 2014-08-27 2016-03-03 Facebook, Inc. Keyword Search Queries on Online Social Networks
WO2021042503A1 (en) * 2019-09-06 2021-03-11 平安科技(深圳)有限公司 Information classification extraction method, apparatus, computer device and storage medium
CN111428044A (en) * 2020-03-06 2020-07-17 中国平安人寿保险股份有限公司 Method, device, equipment and storage medium for obtaining supervision identification result in multiple modes
CN112765357A (en) * 2021-02-05 2021-05-07 北京灵汐科技有限公司 Text classification method and device and electronic equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘羿;冯子恩;万晓娴;: "基于知识图谱的急诊问答系统", 电脑与电信, no. 04 *
索红光;刘玉树;曹淑英;: "一种基于词汇链的关键词抽取方法", 中文信息学报, no. 06 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114049505A (en) * 2021-10-11 2022-02-15 数采小博科技发展有限公司 Method, device, equipment and medium for matching and identifying commodities
CN114038542A (en) * 2021-10-12 2022-02-11 吉林医药学院 Medical information sharing method and system based on medical big data
CN114038542B (en) * 2021-10-12 2022-06-21 吉林医药学院 Medical information sharing method and system based on medical big data
CN113779259A (en) * 2021-11-15 2021-12-10 太平金融科技服务(上海)有限公司 Text classification method and device, computer equipment and storage medium
CN113779259B (en) * 2021-11-15 2022-03-18 太平金融科技服务(上海)有限公司 Text classification method and device, computer equipment and storage medium
CN117150046A (en) * 2023-09-12 2023-12-01 广东省华南技术转移中心有限公司 Automatic task decomposition method and system based on context semantics
CN117150046B (en) * 2023-09-12 2024-03-15 广东省华南技术转移中心有限公司 Automatic task decomposition method and system based on context semantics

Also Published As

Publication number Publication date
CN113254643B (en) 2023-10-27

Similar Documents

Publication Publication Date Title
CN108073568B (en) Keyword extraction method and device
CN113254643B (en) Text classification method and device, electronic equipment and text classification program
CN108304468B (en) Text classification method and text classification device
CN107515877B (en) Sensitive subject word set generation method and device
CN109872162B (en) Wind control classification and identification method and system for processing user complaint information
CN110825877A (en) Semantic similarity analysis method based on text clustering
CN107180084B (en) Word bank updating method and device
WO2021073116A1 (en) Method and apparatus for generating legal document, device and storage medium
CN109710744B (en) Data matching method, device, equipment and storage medium
Al-Ash et al. Fake news identification characteristics using named entity recognition and phrase detection
Probierz et al. Rapid detection of fake news based on machine learning methods
WO2014022172A2 (en) Information classification based on product recognition
CN111460820A (en) Network space security domain named entity recognition method and device based on pre-training model BERT
CN111428028A (en) Information classification method based on deep learning and related equipment
CN113505200B (en) Sentence-level Chinese event detection method combined with document key information
WO2022121163A1 (en) User behavior tendency identification method, apparatus, and device, and storage medium
CN111191442A (en) Similar problem generation method, device, equipment and medium
US11636849B2 (en) Voice data processing based on deep learning
CN112434164B (en) Network public opinion analysis method and system taking topic discovery and emotion analysis into consideration
CN114756675A (en) Text classification method, related equipment and readable storage medium
CN111429184A (en) User portrait extraction method based on text information
Bhakuni et al. Evolution and evaluation: Sarcasm analysis for twitter data using sentiment analysis
Sharaff et al. Towards classification of email through selection of informative features
CN113934848B (en) Data classification method and device and electronic equipment
CN115168580A (en) Text classification method based on keyword extraction and attention mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Zhang Qikun

Inventor after: Wu Zhenzhi

Inventor before: Zhang Qikun

Inventor before: Wu Zhenzhi