CN114996448A - Text classification method and device based on artificial intelligence, terminal equipment and medium - Google Patents

Text classification method and device based on artificial intelligence, terminal equipment and medium Download PDF

Info

Publication number
CN114996448A
CN114996448A CN202210573891.4A CN202210573891A CN114996448A CN 114996448 A CN114996448 A CN 114996448A CN 202210573891 A CN202210573891 A CN 202210573891A CN 114996448 A CN114996448 A CN 114996448A
Authority
CN
China
Prior art keywords
text
sub
classified
texts
semantic expression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210573891.4A
Other languages
Chinese (zh)
Inventor
蒋宏达
陈家豪
徐亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen One Ledger Science And Technology Service Co ltd
Original Assignee
Shenzhen One Ledger Science And Technology Service Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen One Ledger Science And Technology Service Co ltd filed Critical Shenzhen One Ledger Science And Technology Service Co ltd
Priority to CN202210573891.4A priority Critical patent/CN114996448A/en
Publication of CN114996448A publication Critical patent/CN114996448A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention is applicable to the field of artificial intelligence, and particularly relates to a text classification method, a text classification device, terminal equipment and a text classification medium based on artificial intelligence. The method comprises the steps of splitting a text to be classified into N sections of sub-texts, obtaining token feature vectors of each section of sub-text according to a trained semantic model, determining a weight corresponding to each section of sub-text according to the similarity between the token feature vectors, determining a target semantic expression of the text to be classified, matching the target semantic expression with a known semantic expression of a classified text in a database, obtaining a known semantic expression matched with the target semantic expression, and determining a classification result of the text to be classified. The token feature vector is determined by using the semantic model, so that the corresponding weight of the sub-text is analyzed, the token feature vector and the accurate semantic expression are obtained together, and the matching is performed with the existing classification result, the classification result can be determined quickly, the time consumption of calculation is reduced, and the accuracy of text classification is improved.

Description

Text classification method, device, terminal equipment and medium based on artificial intelligence
Technical Field
The invention is applicable to the field of artificial intelligence, and particularly relates to a text classification method, a text classification device, terminal equipment and a text classification medium based on artificial intelligence.
Background
Text matching is an important basic problem in natural language processing, and various natural language processing tasks can be abstracted to a great extent into text matching problems, such as information retrieval, question answering systems, question answering, dialogue systems, machine translation and the like.
The current deep text matching model has two main types: a representational model and an interactive model.
The representative model is used for converting a text into a unique integral representative vector at a presentation layer and then performing matching, can greatly reduce the time consumption of online calculation, but lacks interactive information among the texts and easily loses semantic focus, thereby reducing the accuracy of text matching; the interactive model carries out the first matching among words on the input layer, and carries out the subsequent modeling by taking the matching result as the gray level image, thereby better grasping the semantic focus and carrying out the better modeling on the context importance.
Therefore, in a deep text matching scenario, how to improve the accuracy of text matching while reducing the time consumption of calculation becomes an urgent problem to be solved.
Disclosure of Invention
In view of this, embodiments of the present invention provide a text classification method, apparatus, terminal device and medium based on artificial intelligence, so as to solve the problems of high computation time consumption and low text matching accuracy.
In a first aspect, an embodiment of the present invention provides a text classification method based on artificial intelligence, where the text classification method includes:
splitting a text to be classified into N sections of sub-texts, and coding each section of sub-text according to a trained semantic model to obtain token feature vectors corresponding to the sub-texts, wherein N is a positive integer;
calculating the similarity of the token feature vectors of any sub-text and all other sub-texts, and determining the weight of the corresponding sub-text according to the sum of the similarity of the token feature vectors of each sub-text and all other sub-texts;
determining the target semantic expression of the text to be classified according to the token feature vector of each section of the sub-text and the weight of the corresponding sub-text and by combining the number of all the sub-texts;
and matching the target semantic expression with the known semantic expression of the classified text in the database to obtain the known semantic expression matched with the target semantic expression, and determining the classification of the corresponding classified text as the classification result of the text to be classified.
In a second aspect, an embodiment of the present invention provides a text classification device based on artificial intelligence, where the text classification device includes:
the token feature acquisition module is used for splitting the text to be classified into N sections of sub-texts, coding each section of sub-text according to the trained semantic model, and acquiring a token feature vector corresponding to the sub-text, wherein N is a positive integer;
the weight calculation module is used for calculating the similarity of the token feature vectors of any sub-text and all other sub-texts, and determining the weight of the corresponding sub-text according to the sum of the similarities of the token feature vectors of each sub-text and all other sub-texts;
the semantic expression module is used for determining the target semantic expression of the text to be classified according to the token feature vector of each section of the sub-text and the weight of the corresponding sub-text and by combining the number of all the sub-texts;
and the text classification module is used for matching the target semantic expression with the known semantic expression of the classified text in the database to obtain the known semantic expression matched with the target semantic expression, and determining the classification of the corresponding classified text as the classification result of the text to be classified.
In a third aspect, an embodiment of the present invention provides a terminal device, where the terminal device includes a processor, a memory, and a computer program stored in the memory and executable on the processor, and the processor implements the text classification method according to the first aspect when executing the computer program.
In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the text classification method according to the first aspect.
Compared with the prior art, the embodiment of the invention has the following beneficial effects: the method comprises the steps of splitting a text to be classified into N sections of sub-texts, obtaining a token feature vector of each section of sub-text according to a trained semantic model, determining a weight corresponding to each section of sub-text according to the sum of the similarity of the token feature vectors of each section of sub-text and all other sub-texts, multiplying the token feature vector of each section of sub-text and the corresponding weight, calculating the mean value of the token feature vectors according to the multiplication result and the number of the sub-texts, determining the mean value as a target semantic expression of the text to be classified, matching the target semantic expression with the known semantic expression of the classified text in a database to obtain a known semantic expression matched with the target semantic expression, and determining the classification result of the classified text to which the corresponding classified text belongs. The token feature vector is determined by using the semantic model, so that the corresponding weight of the sub-text is analyzed, the token feature vector and the relatively accurate semantic expression are obtained together, and the classification result is matched with the existing classification result, so that the classification result can be determined relatively quickly, the time consumption of calculation is reduced, and the accuracy of text classification is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
Fig. 1 is a schematic diagram of an application environment of a text classification method based on artificial intelligence according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a text classification method based on artificial intelligence according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a text classification device based on artificial intelligence according to a second embodiment of the present invention;
fig. 4 is a schematic structural diagram of a terminal device according to a third embodiment of the present invention.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.
It should be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to" determining "or" in response to detecting ". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".
Furthermore, in the description of the present invention and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.
Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present invention. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.
The embodiment of the invention can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.
The artificial intelligence base technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
It should be understood that, the sequence numbers of the steps in the following embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
In order to explain the technical means of the present invention, the following description will be given by way of specific examples.
The text classification method based on artificial intelligence provided by the embodiment of the invention can be applied to the application environment shown in fig. 1, wherein a client communicates with a server. The client includes, but is not limited to, a palm top computer, a desktop computer, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a cloud terminal device, a Personal Digital Assistant (PDA), and other terminal devices. The server side can be implemented by an independent server or a server cluster formed by a plurality of servers.
Referring to fig. 2, which is a schematic flowchart of a text classification method based on artificial intelligence according to an embodiment of the present invention, where the text classification method may be applied to the client in fig. 1, and the text classification method may include the following steps:
step S201, the text to be classified is divided into N sections of sub-texts, each section of sub-text is coded according to the trained semantic model, and token feature vectors corresponding to the sub-texts are obtained.
The text to be classified may be long and short sentences composed of different words and phrases, and accordingly, the text to be classified may be a text composed of any language. The semantic model can be a semantic extractor, a feature encoder and the like, and the trained semantic model can perform semantic extraction and analysis on words, phrases and the like to obtain coding features corresponding to analysis objects.
The split text to be classified has smaller granularity, and the trained semantic model is convenient to use for semantic extraction and analysis, so that the corresponding coding features can be accurately and efficiently obtained.
After the text to be classified is divided into N sections of sub-texts, for any sub-text, the sub-text is input into a trained semantic model to extract semantic features, and a token feature vector corresponding to the sub-text is obtained. The trained semantic model adopts a self-coding network, and an originally input text is output through a network structure of a coder-decoder, so that the coder can extract the semantic features of the text. And the encoder structure in the self-coding network is used as the trained semantic model in the embodiment to extract the semantic features of the input sub-text.
Specifically, the self-coding network comprises an encoder and a decoder, wherein the encoder converts input texts into internal feature representations, and the decoder converts the internal feature representations into output texts and ensures that the output texts are consistent with the input texts.
In the training process of the self-encoder, a data set is a large number of texts, for any text, the text is input into the encoder, any character in the text is set to be 0, the encoder performs feature extraction on the text after 0, and a decoder decodes the extracted features to output the original input text. And constructing a loss function according to the difference between the input text and the output text, updating parameters in an encoder and a decoder by converging the loss function to the minimum value, and improving the accuracy of the self-coding network for extracting the text features.
When the loss function is constructed according to the difference between the input text and the output text, the input text and the output text are represented as an X-dimensional word vector through a word vector technology, the loss function is constructed according to cosine similarity between the input text word vector and the output text word vector, and the larger the cosine similarity is, the smaller the difference between the input text word vector and the output text word vector is, and the smaller the loss function is. The dimension X of the word vector can be set according to actual conditions.
For example, when the input text and the output text are represented as an X-dimensional word vector by word2vec technology (a conventional word vector technology), the word vector of the input text is represented as R, and the word vector of the output text is represented as C, the Loss function of the self-coding network is Loss 1:
Figure BDA0003661305920000071
in the formula, R is a word vector of the input text, and C is a word vector of the output text.
Since the smaller the difference between the input text and the output text, the smaller the Loss function Loss1, the training of the self-encoding network is accomplished by updating the parameters in the encoder and decoder by causing the Loss function Loss1 to converge to a minimum.
Since the encoder in the self-encoding network is used for extracting the features of the input text, the encoder in the self-encoding network is used as the trained semantic model in the embodiment to extract the semantic features of the input sub-text, and the extracted semantic features are used as the token feature vectors of the corresponding sub-texts.
Optionally, splitting the text to be classified into N sections of sub-texts includes:
and matching the text to be classified with entries in a known machine dictionary according to a word segmentation algorithm, and determining that the text matched with the entries is a segment of sub-text to obtain N segments of sub-texts.
The word segmentation algorithm comprises a word segmentation method based on character string matching, a word segmentation method based on understanding and a word segmentation method based on statistics. The word segmentation method based on character string matching comprises forward maximum matching, reverse maximum matching and bidirectional maximum matching.
According to different types of texts to be classified, different splitting methods can be adopted to split the texts, and parameters in the splitting methods can be set according to requirements on the lengths and the like of the split texts during use.
In one embodiment, firstly, based on a character string matching word segmentation method, according to a forward maximum matching mode, selecting a symbol string of a fixed length Chinese character as a maximum symbol string, matching the maximum symbol string with a word entry in a 'sufficiently large' machine dictionary, if the maximum symbol string cannot be matched with the word entry in the machine dictionary, removing one Chinese character in the maximum symbol string, continuing to match with the word entry in the machine dictionary until a word corresponding to the maximum symbol string is found in the machine dictionary, and finally splitting a text to be classified into N sections of sub-texts. The matching direction of the maximum symbol string and the word entry in the machine dictionary is from left to right, and the minus direction of the maximum symbol string is from right to left.
In an embodiment, the text to be classified may be matched with entries in a "sufficiently large" machine dictionary in a reverse maximum matching manner based on a character string matching word segmentation method, and the text to be classified is segmented into N segments of sub-texts. The matching direction of the maximum symbol string and the word item in the machine dictionary is from right to left, and the minus character direction of the maximum symbol string is from left to right.
Step S202, calculating the similarity of the token feature vectors of any sub-text and all other sub-texts, and determining the weight of the corresponding sub-text according to the sum of the similarity of the token feature vectors of each sub-text and all other sub-texts.
And for any sub-text, calculating the similarity between the sub-text and the token feature vectors of other sub-texts according to a similarity algorithm. Wherein the similarity algorithm comprises a Pearson correlation coefficient, a Euclidean distance, a Manhattan distance and a cosine similarity. And aiming at the token feature vectors with different dimensions, calculating the similarity between the token feature vectors of the text by using different similarity algorithms.
And then summing the similarity of the token feature vectors of the sub-text and all other sub-texts, calculating a similarity summation result corresponding to each section of the sub-text, and normalizing the summation result of each section of the sub-text. The greater the similarity of the token feature vectors of each segment of the sub text and other sub texts, the deeper the connection between the sub text and other sub texts in the text to be classified is, the deeper the connection between the sub text and the text to be classified is, so when the semantic features of the text to be classified are obtained, the greater the sum of the corresponding similarities of the sub texts is, the greater the weight of the sub text when the semantic features of the text to be classified are obtained.
Therefore, the normalization value of the sum of the similarity of each sub-text is determined as the weight of the sub-text, and is used for combining with the token feature vector of each sub-text to obtain the semantic features of the text to be classified.
Optionally, determining a weight of the corresponding sub-text according to a sum of similarity of token feature vectors of each segment of sub-text and all other sub-texts, includes:
calculating the similarity of the token feature vectors of the sub-texts and other sub-texts aiming at any sub-text;
and summing the similarity of the token feature vectors of the sub-texts and all other sub-texts, and determining the summation result as the weight of the corresponding sub-text.
The number of the sub texts is recorded as N, and for the ith (i ═ 1, 2, …, N) paragraph of the sub texts, the token feature vector is recorded as X i Calculating token feature vector X of ith segment of sub-text i Token feature vector X of jth sub-text j Cosine similarity of (2) ij
Figure BDA0003661305920000091
In the formula, X i Token feature vector, X, for the ith paragraph of sub-text j The token feature vector of the jth paragraph of sub-text.
The cosine similarity Y ij The larger the value of (b) is, the larger the semantic similarity between the ith segment of sub-text and the jth segment of sub-text is.
Respectively calculating token feature vectors X of the ith segment of sub-text i Cosine similarity Y of token feature vector of 1 st, 2 nd, … st i-1, i +1 i1 ,Y i2 ,...,Y i(i-1) ,Y i(i+1) ,...,Y iN Then the cosine similarity and Y of the token feature vectors of the i-th section of the sub-text and all other sub-texts can be obtained i
Figure BDA0003661305920000092
Wherein N is the number of subfolders, Y ij Token feature vector X of ith sub-text i Token feature vector X of jth paragraph of sub-text j Cosine similarity of (c).
Respectively calculating cosine similarity and Y of token feature vectors of 1 st, 2 nd, … th and N th sub texts and all other sub texts 1 ,Y 2 ,...,Y N And then determining the summation result as the weight of the corresponding sub-text.
Optionally, determining the summation result as a weight of the corresponding sub-text includes:
and carrying out normalization processing on the summation result of all the sub texts, and determining the normalization value of each sub text as the weight of the corresponding sub text.
And recording the summation result of the cosine similarity sum as Y, normalizing the cosine similarity sum corresponding to each section of the sub-text by calculating the ratio of the cosine similarity sum corresponding to each section of the sub-text to the sum Y of the cosine similarity sum, and taking each normalized value as the weight of the corresponding sub-text.
Specifically, the sum of the cosine similarity sums is Y:
Figure BDA0003661305920000093
wherein N is the number of subfolders, Y i The token feature vector is the cosine similarity sum of the i-th paragraph of the sub-text.
The weight value of the ith sub-text is Q i
Figure BDA0003661305920000101
In the formula, Y i The token feature vector is the cosine similarity sum of token feature vectors of the ith segment of the sub-text and all other sub-texts, and Y is the sum of N cosine similarity sums.
Then the weights Q of the 1 st, 2 nd, … th and N th sub-texts can be calculated respectively 1 ,Q 2 ,…,Q N
Step S203, determining the target semantic expression of the text to be classified according to the token feature vector of each section of the sub-text and the weight of the corresponding sub-text and by combining the number of all the sub-texts.
The target semantic expression is a vector with the same dimension as the token feature vector of the sub-text, the token feature vector of each segment of the sub-text is used for representing the semantic features of each segment of the sub-text, and the weight of each segment of the sub-text is used for representing the contribution of the semantic features of the segment of the sub-text to the overall semantic features of the text to be classified.
Optionally, determining the target semantic expression of the text to be classified according to the token feature vector of each section of the sub-text and the weight of the corresponding sub-text and by combining the number of all the sub-texts, including:
for any sub-text, multiplying the token feature vector of the sub-text by the corresponding weight to obtain a feature expression result of the corresponding sub-text;
and adding the feature expression results of all the sub texts, dividing the addition result by the number of all the sub texts, and determining the division result as the target semantic expression of the text to be classified.
The method comprises the steps of obtaining a feature expression result of each sub-text by weighting a feature vector of each sub-text token in the text to be classified aiming at any sub-text, and obtaining a target semantic expression of the text to be classified by averaging the feature expression results of the sub-texts, wherein the target semantic expression is used for representing the overall semantic features of the text to be classified.
For example, for the ith paragraph of sub-text, the token feature vector is X i The corresponding weight is Q i The token feature vector X of the sub-text is used i Corresponding weight Q i Multiplying to obtain the feature expression result G of the ith segment of the sub-text i
G i =Q i X i
In the formula, X i Token feature vector for ith segment of sub-text,Q i And the weight value is the weight value corresponding to the ith section of the sub text.
Respectively calculating and obtaining the feature expression results G of the 1 st, 2 nd, … th and N th sub-texts 1 ,G 2 ,...,G N Adding the feature expression results of all the sub texts and dividing the feature expression results by the number of all the sub texts to determine that the target semantic expression of the text to be classified is M i
Figure BDA0003661305920000111
Wherein N is the number of subfolders, G i And expressing the result for the characteristics of the ith paragraph of sub-text.
And S204, matching the target semantic expression with the known semantic expression of the classified text in the database to obtain the known semantic expression matched with the target semantic expression, and determining the classification result of the classified text corresponding to the classified text as the text to be classified.
And the classified texts are texts of which the classes are known, for any classified text, the classified text is taken as a text to be classified, the corresponding operation of the steps S201-S203 is carried out, the known semantic expression of the classified text is determined, and the class and the known semantic expression of the classified text are stored in the database.
The belonged categories and the known semantic expressions of all classified texts are stored in a database by respectively determining the known semantic expressions of all classified texts. When the text to be classified is classified, the known semantic expressions of the classified texts stored in the database are read.
And matching the target semantic expression of the text to be classified with the known semantic expression of the classified text in the database, wherein the more similar the semantic features of the classified text and the text to be classified are, the greater the corresponding matching degree is, so that the classified text most similar to the text to be classified can be determined through the known semantic expression most matched with the target semantic expression, and the classified text is determined as the classified result of the text to be classified.
Optionally, matching the target semantic expression with the known semantic expression of the classified text in the database to obtain the known semantic expression matched with the target semantic expression, and determining the classification result of the classified text as the text to be classified, including:
similarity calculation is carried out on the target semantic expression and the known semantic expressions of the classified texts in the database, and the similarity degree of the target semantic expression and each known semantic expression is obtained;
taking the known semantic expression corresponding to the maximum value in the similarity degree as the known semantic expression matched with the target semantic expression;
and determining the classification of the classified text corresponding to the known semantic expression matched with the target semantic expression as the classification result of the text to be classified.
And determining the known semantic expression of each classified text according to the corresponding operation of the steps S201-S203, and storing the category and the known semantic expression of each classified text in a database. And reading the known semantic expression of each classified text stored in the database when classifying the text to be classified.
And respectively calculating the similarity degrees with the target semantic expression, and determining the maximum value of the similarity degrees. The known semantic expression corresponding to the maximum value of the similarity degree is the known semantic expression matched with the target semantic expression, and finally the classification result of the text to be classified is determined to be the classification of the classified text corresponding to the known semantic expression matched with the target semantic expression.
For example, the target semantic expression of the text to be classified is denoted as D 0 Recording the number of known semantic expressions in the database as M, and recording the k (k) th known semantic expression as D k . Since the closer the known semantic expression and the target semantic expression are, the higher the similarity degree between the known semantic expression and the target semantic expression is, the present embodiment calculates the target semantic expression D 0 With the kth known semantic expression D k Determining the similarity degree T of the known semantic expression and the target semantic expression k
Figure BDA0003661305920000121
In the formula, D 0 For the targeted semantic representation of the text to be classified, D k Is the kth known semantic expression in the database.
Respectively calculating the 1 st, 2 nd, M known semantic expressions D in the database 1 ,D 2 ,…,D M Degree of similarity T with target semantic expression 1 ,T 2 ,…,T M And determining the known semantic expression corresponding to the maximum value of the similarity degree, and determining the classified text corresponding to the known semantic expression, so that the category of the classified text can be used as the classification result of the text to be classified, and the text classification of the text to be classified is completed.
Optionally, matching the target semantic expression with the known semantic expression of the classified text in the database to obtain the known semantic expression matched with the target semantic expression, and determining the classification result of the classified text as the text to be classified, including:
inputting the target semantic expression of the text to be classified and the known semantic expression of each classified text in the database into the trained twin network model for matching, and outputting the similarity degree of the target semantic expression and each known semantic expression;
taking the known semantic expression corresponding to the maximum value in the similarity degree as the known semantic expression matched with the target semantic expression;
and determining the classification of the classified text corresponding to the known semantic expression matched with the target semantic expression as the classification result of the text to be classified.
Wherein, because the twin network is used for processing the situation that two inputs are similar, the twin network is based on a coupling framework established by two artificial neural networks, two samples are used as the inputs, two sub-networks respectively receive one input, the output is embedded with the representation of a high latitude space, and the similarity degree of the two samples, such as Euclidean distance, is compared by calculating the distance between the two representations.
Therefore, the embodiment uses the twin network to perform similarity calculation between the target semantic expression and the known semantic expression of the classified text in the database, so as to obtain the similarity between the text to be classified and each known semantic expression.
Specifically, in the training process of the twin network, the data set is a large number of text semantic expressions, and whether each text is matched or not is labeled, if the two texts are matched, the corresponding label is 1, and if the two samples are not matched, the corresponding label is 0. A contrast Loss function (contrast Loss) is adopted, and the Loss function of the twin network is constructed according to the matching relation of the samples and the similarity between the samples, so that in the contrast Loss function, when two input samples are matched, the Euclidean distance between the representations of the two samples is larger, the contrast Loss function is larger, and the effect of the twin network model is poorer; when the two input samples are not matched, the similarity between the two samples is small, and under a normal condition, the Euclidean distance between the two sample representations is large, so that if the Euclidean distance between the two sample representations obtained in the twin network is small on the contrary, the contrast loss function is large, and the effect of the twin network model is poor. Therefore, in the training process of the twin network, the shared parameters in the twin network are updated according to the contrast loss function until the contrast loss function converges to the minimum value, so as to improve the matching accuracy of the twin network model.
For example, in the training process of the twin network, the text semantic expressions in the data set are divided into a plurality of batches, and each batch contains P text semantic expressions.
For the text semantic expressions in each batch, selecting the a (a-1, 2.. P) -th text semantic expression and the b (b-1, 2, … P, b ≠ a) text semantic expression as the input of two sub-networks of the twin network, and performing high-latitude spatial characterization on the two text semantic expressions by the two sub-networks through shared parameters to obtain two corresponding text semantic representations L a And L b Wherein the text semantics represents L a And L b All vectors are vectors, and if the vector dimension is marked as w, the text semantic representation L is marked a Is marked as L a =[l a1 ,l a2 ,...,l aw ]Semantically characterizing the text L b Is marked as L b =[l b1 ,l b2 ,...,l bw ]。
Then, two text semantic representations L are calculated a And L b Has an Euclidean distance d between ab
Figure BDA0003661305920000141
Wherein w is a text semantic representation L a And L b Of vector dimension l ar Semantic representation L for text a Middle r-dimension data, l br Semantic representation L for text b And (5) medium r-th dimension data.
Marking the maximum value of the Euclidean distance in the twin network model as maxd, and then representing the text semantic meaning L according to the maximum value of the Euclidean distance a And L b Normalizing the Euclidean distance between the two to obtain a corresponding normalized value d ab ′:
Figure BDA0003661305920000142
In the formula, d ab Semantic representation L for text a And L b Maxd is the maximum value of the euclidean distance in the twin network model.
Thus, the contrast Loss function for this twin network is Loss 2:
Figure BDA0003661305920000143
wherein P is the number of semantic expressions of the text in each batch, d ab ' text semantic representation L corresponding to the a text semantic expression and the b text semantic expression a And L b Normalization of the Euclidean distance therebetweenThe conversion value y is the label corresponding to the a-th text semantic expression and the b-th text semantic expression, d 0 The Euclidean distance threshold value is preset and can be set according to actual conditions.
Wherein when the a-th text semantic expression and the b-th text semantic expression are matched, y is 1, and the comparison loss function
Figure BDA0003661305920000151
Correspondingly, the more similar the semantic expressions of the two texts are, the more the two texts represent L a And L b The smaller the normalized value of the euclidean distance between, the smaller the corresponding Loss function Loss 2. When the a-th and b-th text semantic expressions do not match, y 0, contrast loss function
Figure BDA0003661305920000152
Accordingly, when two texts represent L, the semantic expressions of the two texts are not similar a And L b The smaller the normalized value of the euclidean distance between the two networks is, the lower the accuracy of the twin network is, and the larger the corresponding Loss function Loss2 is.
And training the text semantic expression in the data set, and continuously updating the shared parameters in the twin network until convergence to obtain a trained twin network model.
Then, when the text to be classified is classified, expressing the target semantic meaning of the text to be classified D 0 And a known semantic representation D of the kth classified text stored in the database k Respectively as the input of two sub-networks of the trained twin network, outputting the similarity degree between the target semantic expression and the kth known semantic expression stored in the database, and recording the similarity degree as T k . Then the 1 st, 2 nd, M known semantic expressions D in the database can be obtained respectively 1 ,D 2 ,...,D M And target semantic representation D 0 Degree of similarity T of 1 ,T 2 ,...,T M Determining the known semantic expression corresponding to the maximum value of the similarity degree, and determining the classified text corresponding to the known semantic expression, so that the position of the classified text can be determinedAnd the category is used as a classification result of the text to be classified, and the text classification of the text to be classified is completed.
The method comprises the steps of splitting a text to be classified into N sections of sub-texts, obtaining a token feature vector of each section of sub-text according to a trained semantic model, determining a weight corresponding to each section of sub-text according to the sum of similarity of the token feature vectors of each section of sub-text and all other sub-texts, multiplying the token feature vector of each section of sub-text and the corresponding weight, calculating a mean value of the token feature vectors according to the multiplication result and the number of the sub-texts, determining the mean value as a target semantic expression of the text to be classified, matching the target semantic expression with a known semantic expression of the classified text in a database to obtain a known semantic expression matched with the target semantic expression, and determining the classification result of the classified text corresponding to the classified text as the text to be classified. The token feature vector is determined by using the semantic model, so that the corresponding weight of the sub-text is analyzed, more accurate semantic expression is obtained together with the token feature vector, and the token feature vector is matched with the existing classification result, the classification result can be determined more quickly, the time consumption is reduced, and the accuracy of text classification is improved.
Fig. 3 is a block diagram of a structure of a text classification apparatus based on artificial intelligence according to a second embodiment of the present invention, and only shows relevant parts according to the second embodiment of the present invention for convenience of description.
Referring to fig. 3, the text classification apparatus includes:
the token feature obtaining module 31 is configured to split a text to be classified into N sections of sub-texts, encode each section of sub-text according to a trained semantic model, and obtain a token feature vector corresponding to the sub-text, where N is a positive integer;
the weight calculation module 32 is configured to calculate similarity between any sub-text and the token feature vectors of all other sub-texts, and determine a weight of the corresponding sub-text according to a sum of the similarities between each sub-text and the token feature vectors of all other sub-texts;
the semantic expression module 33 is configured to determine a target semantic expression of the text to be classified according to the token feature vector of each segment of the sub-text and the weight of the corresponding sub-text, and by combining the number of all the sub-texts;
and the text classification module 34 is configured to match the target semantic expression with the known semantic expression of the classified text in the database, obtain the known semantic expression matched with the target semantic expression, and determine a classification result corresponding to the classified text and classified as a text to be classified.
Optionally, the token feature obtaining module 31 includes:
the text splitting unit is used for matching the text to be classified with entries in a known machine dictionary according to a word segmentation algorithm and splitting the text to be classified into N sections of sub-texts;
optionally, the weight calculating module 32 includes:
the similarity calculation unit is used for calculating the similarity of the token feature vectors of the sub-texts and other sub-texts aiming at any sub-text;
and the weight determining unit is used for summing the similarity of the token feature vectors of the sub-texts and all other sub-texts, and determining the summation result as the weight of the corresponding sub-text.
Optionally, the weight determining unit includes:
and the weight determining subunit is used for normalizing the summation result of all the sub texts and determining the normalization value of each sub text as the weight of the corresponding sub text.
Optionally, the semantic expression module 33 includes:
the feature expression unit is used for multiplying the token feature vector of the sub-text by the corresponding weight to obtain a feature expression result corresponding to the sub-text;
and the semantic determining unit is used for adding the feature expression results of all the sub texts, dividing the addition result by the number of all the sub texts, and determining the division result as the target semantic expression of the text to be classified.
Optionally, the text classification module 34 includes:
the text matching unit is used for obtaining the maximum value of the similarity degree and determining the known semantic expression corresponding to the maximum value of the similarity degree as the known semantic expression matched with the target semantic expression;
and the text classification unit is used for determining the classification of the classified text corresponding to the known semantic expression matched with the target semantic expression as the classification result of the text to be classified.
It should be noted that, because the contents of information interaction, execution process, and the like between the modules are based on the same concept as the method embodiment of the present invention, specific functions and technical effects thereof may be referred to specifically in the method embodiment section, and are not described herein again.
Fig. 4 is a schematic structural diagram of a terminal device according to a third embodiment of the present invention. As shown in fig. 4, the terminal device of this embodiment includes: at least one processor (only one shown in fig. 4), a memory, and a computer program stored in the memory and executable on the at least one processor, the processor when executing the computer program implementing the steps in any of the various text classification method embodiments described above.
The terminal device may include, but is not limited to, a processor, a memory. It will be understood by those skilled in the art that fig. 4 is only an example of a terminal device and does not constitute a limitation of the terminal device, and the terminal device may include more or less components than those shown, or some components may be combined, or different components may be included, such as a network interface, a display screen, an input device, etc.
The Processor may be a CPU, or other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory includes a readable storage medium, an internal memory and the like, wherein the internal memory may be a memory of the terminal device, and the internal memory provides an environment for an operating system and execution of computer readable instructions in the readable storage medium. The readable storage medium may be a hard disk of the terminal device, and in other embodiments, may also be an external storage device of the terminal device, for example, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the terminal device. Further, the memory may also include both an internal storage unit of the terminal device and an external storage device. The memory is used for storing an operating system, application programs, a BootLoader (BootLoader), data, and other programs, such as program codes of a computer program, and the like. The memory may also be used to temporarily store data that has been output or is to be output.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules, so as to perform all or part of the functions described above. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention. The specific working processes of the units and modules in the above-mentioned apparatus may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. The integrated unit, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method of the above embodiments may be implemented by a computer program, which may be stored in a computer readable storage medium and used by a processor to implement the steps of the above method embodiments. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code, recording medium, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, and software distribution media. Such as a usb-disk, a removable hard disk, a magnetic or optical disk, etc. In certain jurisdictions, computer-readable media may not be an electrical carrier signal or a telecommunications signal in accordance with legislative and patent practice.
The present invention realizes all or part of the processes in the method of the above embodiments, and may also be implemented by a computer program product, where when the computer program product runs on a terminal device, the steps in the method embodiments may be implemented when the terminal device executes the computer program product.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other ways. For example, the above-described embodiments of the apparatus/terminal device are merely illustrative, and for example, a module or a unit may be divided into only one logical function, and may be implemented in other ways, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims (10)

1. A text classification method based on artificial intelligence is characterized by comprising the following steps:
splitting a text to be classified into N sections of sub-texts, and coding each section of sub-text according to a trained semantic model to obtain token feature vectors corresponding to the sub-texts, wherein N is a positive integer;
calculating the similarity of the token feature vectors of any sub-text and all other sub-texts, and determining the weight of the corresponding sub-text according to the sum of the similarity of the token feature vectors of each sub-text and all other sub-texts;
determining the target semantic expression of the text to be classified according to the token feature vector of each section of the sub-text and the weight of the corresponding sub-text and by combining the number of all the sub-texts;
and matching the target semantic expression with the known semantic expression of the classified text in the database to obtain the known semantic expression matched with the target semantic expression, and determining the classification of the corresponding classified text as the classification result of the text to be classified.
2. The method for classifying text according to claim 1, wherein the splitting the text to be classified into N segments of sub-texts comprises:
and matching the text to be classified with entries in a known machine dictionary according to a word segmentation algorithm, and determining that the text matched with the entries is a segment of sub-text to obtain N segments of sub-texts.
3. The text classification method according to claim 1, wherein the determining the weight of the corresponding sub-text according to the sum of the similarity of the token feature vectors of each sub-text and all other sub-texts comprises:
calculating the similarity of the token feature vectors of the sub-texts and other sub-texts aiming at any sub-text;
and summing the similarity of the token feature vectors of the sub-texts and all other sub-texts, and determining the summation result as the weight of the corresponding sub-text.
4. The method of claim 4, wherein determining the summation result as a weight of the corresponding sub-text comprises:
and carrying out normalization processing on the summation result of all the sub texts, and determining the normalization value of each sub text as the weight of the corresponding sub text.
5. The text classification method according to claim 1, wherein the determining the target semantic expression of the text to be classified according to the token feature vector of each segment of the sub-text and the weight of the corresponding sub-text and by combining the number of all the sub-texts comprises:
for any sub-text, multiplying the token feature vector of the sub-text by the corresponding weight to obtain a feature expression result of the corresponding sub-text;
and adding the feature expression results of all the sub texts, dividing the addition result by the number of all the sub texts, and determining the division result as the target semantic expression of the text to be classified.
6. The method according to any one of claims 1 to 5, wherein the step of matching the target semantic expression with the known semantic expression of the classified text in the database to obtain the known semantic expression matched with the target semantic expression and determining the classification result of the classified text corresponding to the classified text as the text to be classified comprises:
carrying out similarity calculation on the target semantic expression and known semantic expressions of classified texts in a database to obtain the similarity of the target semantic expression and each known semantic expression;
taking the known semantic expression corresponding to the maximum value in the similarity degrees as the known semantic expression matched with the target semantic expression;
and determining the classification of the classified text corresponding to the known semantic expression matched with the target semantic expression as the classification result of the text to be classified.
7. An artificial intelligence based text classification apparatus, characterized in that the text classification apparatus comprises:
the token feature acquisition module is used for splitting the text to be classified into N sections of sub-texts, coding each section of sub-text according to the trained semantic model, and acquiring a token feature vector corresponding to the sub-text, wherein N is a positive integer;
the weight calculation module is used for calculating the similarity of the token feature vectors of any sub-text and all other sub-texts, and determining the weight of the corresponding sub-text according to the sum of the similarities of the token feature vectors of each sub-text and all other sub-texts;
the semantic expression module is used for determining the target semantic expression of the text to be classified according to the token feature vector of each section of the sub-text and the weight of the corresponding sub-text and by combining the number of all the sub-texts;
and the text classification module is used for matching the target semantic expression with the known semantic expression of the classified text in the database to obtain the known semantic expression matched with the target semantic expression, and determining the classification of the corresponding classified text as the classification result of the text to be classified.
8. The apparatus for classifying texts according to claim 7, wherein the weight calculation module comprises:
the similarity calculation unit is used for calculating the similarity of the token feature vectors of the sub-texts and other sub-texts aiming at any sub-text;
and the weight determining unit is used for summing the similarity of the token feature vectors of the sub-texts and all other sub-texts, and determining the summation result as the weight of the corresponding sub-text.
9. A terminal device, characterized in that the terminal device comprises a processor, a memory and a computer program stored in the memory and executable on the processor, the processor implementing the text classification method according to any one of claims 1 to 6 when executing the computer program.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the text classification method according to any one of claims 1 to 6.
CN202210573891.4A 2022-05-25 2022-05-25 Text classification method and device based on artificial intelligence, terminal equipment and medium Pending CN114996448A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210573891.4A CN114996448A (en) 2022-05-25 2022-05-25 Text classification method and device based on artificial intelligence, terminal equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210573891.4A CN114996448A (en) 2022-05-25 2022-05-25 Text classification method and device based on artificial intelligence, terminal equipment and medium

Publications (1)

Publication Number Publication Date
CN114996448A true CN114996448A (en) 2022-09-02

Family

ID=83029902

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210573891.4A Pending CN114996448A (en) 2022-05-25 2022-05-25 Text classification method and device based on artificial intelligence, terminal equipment and medium

Country Status (1)

Country Link
CN (1) CN114996448A (en)

Similar Documents

Publication Publication Date Title
CN111444340B (en) Text classification method, device, equipment and storage medium
CN107085581B (en) Short text classification method and device
CN107273503B (en) Method and device for generating parallel text in same language
CN111695352A (en) Grading method and device based on semantic analysis, terminal equipment and storage medium
CN113239700A (en) Text semantic matching device, system, method and storage medium for improving BERT
CN111159485B (en) Tail entity linking method, device, server and storage medium
CN114090823A (en) Video retrieval method, video retrieval device, electronic equipment and computer-readable storage medium
CN107862058B (en) Method and apparatus for generating information
CN110956038B (en) Method and device for repeatedly judging image-text content
CN112182167B (en) Text matching method and device, terminal equipment and storage medium
CN112349294B (en) Voice processing method and device, computer readable medium and electronic equipment
CN110968697A (en) Text classification method, device and equipment and readable storage medium
CN111858878A (en) Method, system and storage medium for automatically extracting answer from natural language text
CN116483979A (en) Dialog model training method, device, equipment and medium based on artificial intelligence
CN115759119A (en) Financial text emotion analysis method, system, medium and equipment
CN115408488A (en) Segmentation method and system for novel scene text
CN115312033A (en) Speech emotion recognition method, device, equipment and medium based on artificial intelligence
CN111368066B (en) Method, apparatus and computer readable storage medium for obtaining dialogue abstract
CN113326383B (en) Short text entity linking method, device, computing equipment and storage medium
CN112597299A (en) Text entity classification method and device, terminal equipment and storage medium
CN116680401A (en) Document processing method, document processing device, apparatus and storage medium
CN111401069A (en) Intention recognition method and intention recognition device for conversation text and terminal
CN112749554B (en) Method, device, equipment and storage medium for determining text matching degree
CN114996448A (en) Text classification method and device based on artificial intelligence, terminal equipment and medium
CN114328894A (en) Document processing method, document processing device, electronic equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination