CN112052331A

CN112052331A - Method and terminal for processing text information

Info

Publication number: CN112052331A
Application number: CN201910489950.8A
Authority: CN
Inventors: 彭团民
Original assignee: Wuhan TCL Group Industrial Research Institute Co Ltd
Current assignee: Wuhan TCL Group Industrial Research Institute Co Ltd
Priority date: 2019-06-06
Filing date: 2019-06-06
Publication date: 2020-12-08

Abstract

The invention is suitable for the technical field of computers, and provides a method and a terminal for processing text information, wherein the method comprises the following steps: acquiring text information to be classified; preprocessing the text information to obtain target text information; inputting the target text information into a trained language representation model for processing to obtain a target word vector set of the target text information; and inputting the target word vector set into a trained classification model for classification, and outputting classification information corresponding to the target word vector set by the classification model. According to the scheme, the trained language representation model is used for converting the preprocessed text information into the word vector set, so that the obtained vector semantic information is rich, and therefore the classification result obtained by classifying the word vector set through the trained classification model is high in accuracy; and the trained language representation model and the classification model are adopted to process the text information, so that the speed of processing the text information is improved.

Description

Method and terminal for processing text information

Technical Field

The invention belongs to the technical field of computers, and particularly relates to a method and a terminal for processing text information.

Background

With the rapid development of the internet, in this big data age, the internet holds a huge amount of information and data including text, sound, images, video, and the like. The text refers to media news, science and technology, reports, e-mails, technical patents, books, and the like. Compared with image and sound data, the texts occupy less network resources and are easier to upload and download, so that most information in the network resources appears in the form of texts. How to organize and manage the information effectively and find the information needed by the user from the information quickly, accurately and comprehensively is important.

However, the existing text information classification method is realized based on a word2vec word vector model (word to vector) and a naive Bayes algorithm. However, when the text classification method is used for processing text information, the extracted semantic information is single, the processing speed is low, and the classification result is inaccurate. The semantic information is one of the expression forms of information, and means information with a certain meaning capable of eliminating uncertainty of an object.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method and a terminal for processing text information, so as to solve the problems in the prior art that, when text information is processed based on a word2vec word vector model (word to vector) and a naive bayesian algorithm, extracted semantic information is single, processing speed is low, and a classification result is inaccurate.

A first aspect of an embodiment of the present invention provides a method for processing text information, including:

acquiring text information to be classified;

removing redundant information in the text information to obtain target text information;

inputting the target text information into a language representation model for processing to obtain a target word vector set of the target text information;

inputting the target word vector set into a trained classification model for classification processing to obtain classification information corresponding to the target word vector set; the input of the classification model is a word vector set in a word vector sample set, and the output of the classification model is classification information corresponding to the word vector set; the classification information is used for representing the classification type of the text information.

A second aspect of an embodiment of the present invention provides a terminal for processing text information, including:

the acquiring unit is used for acquiring text information to be classified;

the removing unit is used for removing redundant information in the text information to obtain target text information;

the processing unit is used for inputting the target text information into a language representation model for processing to obtain a target word vector set of the target text information;

the classification unit is used for inputting the target word vector set into a trained classification model for classification processing to obtain classification information corresponding to the target word vector set; the input of the classification model is a word vector set in a word vector sample set, and the output of the classification model is classification information corresponding to the word vector set; the classification information is used for representing the classification type of the text information.

A third aspect of an embodiment of the present invention provides another terminal for processing text information, including a processor, an input device, an output device, and a memory, where the processor, the input device, the output device, and the memory are connected to each other, where the memory is used to store a computer program that supports the terminal to execute the above method, the computer program includes program instructions, and the processor is configured to call the program instructions to execute the following steps:

text information to be classified;

A fourth aspect of embodiments of the present invention provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of:

text information to be classified;

According to the embodiment of the invention, the text information to be classified is obtained; preprocessing the text information to obtain target text information; inputting the target text information into a trained language representation model for processing to obtain a target word vector set of the target text information; and inputting the target word vector set into a trained classification model for classification, and outputting classification information corresponding to the target word vector set by the classification model. According to the scheme, the trained language representation model is used for converting the preprocessed text information into the word vector set, so that the obtained vector semantic information is rich, and therefore the classification result obtained by classifying the word vector set through the trained classification model is high in accuracy; and the trained language representation model and the classification model are adopted to process the text information, so that the speed of processing the text information is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a flowchart illustrating an implementation of a method for processing text information according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of three models for processing classification of textual information;

FIG. 3 is a schematic diagram of a language representation model inputting target text information;

FIG. 4 is a pictorial diagram of a classification model;

FIG. 5 is a flowchart illustrating an implementation of a method for processing text information according to another embodiment of the present invention;

FIG. 6 is a flowchart illustrating an implementation of a method for processing text information according to yet another embodiment of the present invention;

FIG. 7 is a flowchart illustrating an implementation of a method for processing text information according to yet another embodiment of the present invention;

fig. 8 is a schematic diagram of a terminal for processing text information according to an embodiment of the present invention;

fig. 9 is a schematic diagram of a terminal for processing text information according to another embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Example one

Referring to fig. 1, fig. 1 is a schematic flow chart of a method for processing text information according to an embodiment of the present invention. The main executing body of the method for processing text information in this embodiment is a terminal for processing text information, and the terminal for processing text information includes, but is not limited to, a mobile terminal such as a smart phone, a tablet computer, a Personal Digital Assistant (PDA), and the like, and may also include a terminal such as a desktop computer. The method of processing text information as shown in fig. 1 may include:

s101: and acquiring text information to be classified.

The text information includes text data, which may be words, phrases, or a sentence or a combination of sentences. The textual information may be user personal information, geographic location, micro-information, product reviews, news headlines, resumes, client opinions, treaty documents, news content, item names, etc. It should be noted that the above is only an exemplary illustration, and the format and content of the text information are not limited.

And when the terminal for processing the text information detects the text information classification instruction, acquiring the text information to be classified. The text information classification instruction refers to an instruction for instructing the terminal to perform text information classification. The text information classification instruction may be triggered by a user, such as a user clicking a text information classification option in the terminal. The text information to be classified is obtained by the terminal, wherein the text information to be classified is uploaded to the terminal by a user, or the text information to be classified is obtained by the terminal according to a file identifier contained in a text information classification instruction, and the text information to be classified in the text file is extracted.

S102: and preprocessing the text information to obtain target text information.

The preprocessing refers to extracting effective characters in the text information; or the preprocessing may be to remove redundant information in the text information.

The effective characters refer to information which has practical significance and influences the classification of the text information. When the text information is preprocessed to extract effective characters in the text information, the target text information is generated by combining the effective characters according to the sequence when the effective characters are extracted.

The redundant information is information which has no practical significance and has no influence on the classification of the text information. The target text information is the text information generated by sequentially combining the residual information after removing the redundant information in the text information. And the residual information after removing the redundant information is the valid character. For example, when the text information is "Chinese language #? "redundant information is" "," "%", "#", "? ", at this time," ","% "," # ","? "the rest of the words in China are all said all over the world" as valid characters.

The redundant information may be stop words, punctuation marks, etc. in the text information. The stop words refer to words without practical meaning, and are usually qualifiers, moods, adverbs, prepositions, conjunctions, English characters, numbers, mathematical characters, and the like.

Wherein, the English character is a letter which exists independently and has no practical meaning. If the English character is a letter combination and has meaning, the English character is determined as a valid character and cannot be removed. For example, when the english character is CPU, MAC, HR, etc., it remains as a valid character and is not removed.

The terminal for processing the text information can obtain a deactivation word list in a local database and a server, wherein the deactivation word list comprises limiting words, tone auxiliary words, prepositions, connection words, English characters, numbers, mathematical characters, punctuation marks and the like, and a user can adjust the deactivation word list according to actual conditions. The terminal compares the vocabulary contained in the deactivation vocabulary with the vocabulary contained in the text information and discards the vocabulary in the text information which is the same as the vocabulary in the deactivation vocabulary; and combining the information left after the text information stops the words to generate target text information.

Further, S102 may include S1021-S1022, as follows:

s1021: and extracting effective characters in the text information.

The effective characters can comprise a plurality of words, short sentences and single characters with practical meanings; the terminal processing the text message extracts valid characters in the text message.

S1022: and combining the effective characters to generate target text information.

And combining the effective characters according to the sequence of extracting the effective characters to generate target text information.

For example, the text information is: "he says: all over the world say% chinese talk #, the terminal extracts "other say: the effective characters in the Chinese dialect # o are extracted in sequence all over the world, namely the effective characters are extracted in sequence all over the world, in the Chinese dialect # o, the effective characters are combined to obtain the Chinese dialect all over the world.

S103: inputting the target text information into a trained language representation model for processing to obtain a target word vector set of the target text information;

the language representation model is obtained by training based on the corresponding relation between the text information in the sample set and the classification types corresponding to the text information in the sample set.

The language representation model (BERT) is composed of a bi-directional Encoder. The depth of the context pre-training words in all layers can be adjusted jointly to carry out bidirectional representation, the model is used for processing target text information, and the obtained vector contains more semantic information.

As shown in FIG. 2, from left to right, there are a language representation model (BERT), an OpenAI-GPT model, and an ELMo model, respectively. The three models are all used for classifying the text information, wherein E1 and E2 … … En are input text information, and T1 and T2 … … Tn are corresponding classification results. Trm is an abbreviation of the transform algorithm model, and LSTM is a neural network structure. As can be seen from fig. 2, the OpenAI GPT model uses a left-to-right transform, the ELMo uses an independently trained left-to-right neural network structure, only the BERT uses a bidirectional transform, and all processing layers of the BERT depend on left and right contexts together, so that the BERT has rich vector semantic information when processing target text information.

The target word vector set is obtained by performing vector conversion on target text information through a language representation model.

The trained language characterization model in this embodiment is obtained by training a sample set using a machine learning algorithm. The sample set comprises a plurality of groups of text information and classification types corresponding to each group of text information in the sample set. In the training process, the input of the language representation model is the text information in the sample set, and the output of the language representation model is the classification type corresponding to the text information. And calculating a cross entropy according to the output classification type and the real classification type (namely the classification type corresponding to the text information in the sample set), reversely transmitting the convolution layer in the language representation model according to the cross entropy, and adjusting parameters to obtain a trained model.

Although the target text information can be classified by the trained language representation model, in order to improve the accuracy of the classification result, in the present embodiment, the classification model is trained separately for more accurate classification. Only the language representation model is obtained to process the target text information, and the target word vector set corresponding to the target text information is obtained

Specifically, the language representation model may include a word segmentation algorithm, the target text information is input into the language representation model, the language representation model performs word segmentation processing on the target text information through the word segmentation algorithm, converts a word segmentation processing result into word vectors, and combines the word vectors in sequence according to a sequence of performing word segmentation processing on the target text information through the word segmentation algorithm to generate a target word vector set.

Fig. 3 is an example of how the language representation model represents the input target text information. Wherein, the [ SEP ] has a separation function, and when the input target text information is two sentences, the input target text information can be separated by a special token [ SEP ]; [ CLS ] indicates the beginning of a sentence.

Input this action is the Input of the target text information; the position embedding line represents position representation and corresponds to the maximum length of a sentence; segment embedding is a behavior segment representation; for each Token embedding, the segment embedding and the position embedding corresponding to each Token embedding are added to generate the Token embedding.

S104: inputting the target word vector set into a trained classification model for classification processing, and outputting classification information corresponding to the target word vector set by the classification model;

the classification model is obtained by training based on the corresponding relation between the word vector set in the word vector sample set and the classification information corresponding to the word vector set; and the classification information corresponding to the word vector set in the word vector sample set is used for representing the classification type of the text information.

The trained classification model is obtained by training a word vector sample set by using a machine learning algorithm. The word vector sample set comprises a plurality of word vector sets, and the word vector sets can be obtained by further performing vector conversion on information obtained by removing redundant information from text information to be trained. Specifically, reference may be made to a process of processing text information to obtain target text information, and processing the target text information to obtain a target word vector set.

In the training process, the input of the trained classification model is a word vector set in the word vector sample set, and the output of the classification model is classification information corresponding to the word vector set. The word vector is obtained by performing processes of word segmentation processing, word segmentation conversion and the like on the text information to be classified, so that the output of the classification model can be understood as the classification type to which the text information to be classified belongs. The classification information is used for representing the classification type of the text information.

And the terminal for processing the text information inputs the target word vector set into the trained classification model, and the classification model classifies the target word vector set to obtain the classification information corresponding to the target word vector set. The trained classification model comprises a trained function:

wherein x ∈ Rⁿ⁺¹，w_k∈Rⁿ⁺¹W is a known parameter which is trained, Y represents classification information, x represents an input target word vector set, exp represents an exponential function with a natural constant e as a base, and K represents multi-layer classification.

And the terminal for processing the text information inputs the target word vector set into the trained classification model, namely an x value is input, and the classification model outputs a Y value, namely classification information corresponding to the target word vector set.

Fig. 4 is a schematic diagram of a classification model, which may include a plurality of base classifiers for performing multi-level classification on a target word vector set. As shown in fig. 4, the classification model may include a plurality of base classifiers, such as a base classifier 0, a base classifier 01, a base classifier 02, and a base classifier 03, and classify the target word vector set layer by layer according to a hierarchy. For example, base classifier 0 is science, and base classifiers 01, 02, and 03 are mathematical, physical, and chemical, respectively; the classification can be continued, and the base classifier 01 can further comprise a base classifier 04, a base classifier 05 and a base classifier 06, wherein the base classifier 04, the base classifier 05 and the base classifier 06 are respectively statistics, topology and functional. The above is merely an exemplary illustration, and the number of base classifiers and the number of layers of classification levels in the classification model are not limited.

Example two

Referring to fig. 5, fig. 5 is a schematic flow chart of a method for processing text information according to another embodiment of the present invention. The main executing body of the method for processing text information in this embodiment is a terminal for processing text information, and the terminal for processing text information includes, but is not limited to, a mobile terminal such as a smart phone, a tablet computer, a Personal Digital Assistant (PDA), and the like, and may also include a terminal such as a desktop computer. The method of processing text information as shown in fig. 5 may include:

s201: and acquiring text information to be classified.

In this embodiment, S201 is identical to S101 in the previous embodiment, and please refer to the related description of S101 in the previous embodiment, which is not repeated herein.

S202: and preprocessing the text information to obtain target text information.

In this embodiment, S202 is identical to S102 in the previous embodiment, and please refer to the related description of S102 in the previous embodiment, which is not repeated herein.

S203: inputting the target text information into a trained language representation model for processing to obtain a target word vector set of the target text information;

In this embodiment, S203 is identical to S103 in the previous embodiment, and please refer to the related description of S103 in the previous embodiment, which is not repeated herein.

Further, in order to improve the accuracy of classifying text information, before the target word vector set is input into the trained classification model for classification processing, and the classification model outputs the classification information corresponding to the target word vector set, S204-S208 may be further included, specifically as follows:

s204: a training sample set and a testing sample set are obtained.

The training sample set and the testing sample set may be sample sets uploaded to the terminal by a user, or may be sample sets automatically obtained by the terminal when the terminal receives a training model instruction.

It should be noted that the trained classification model generates a confrontation model during training, and the confrontation model is composed of a generation model and a discriminant model. The judgment model is used for processing and classifying the word vector set in the training sample set; generating a model for creating classification information similar to the classification result; the two models are subjected to antagonistic training together, the model is judged to judge the accuracy of the classification information generated by the generated model, and the capabilities of the two models are gradually enhanced in the training process.

The training sample set comprises a plurality of word vector sets, and the testing sample set comprises a plurality of word vector sets and classification information corresponding to each word vector set. It is understood that the test sample set includes, in addition to the word vector set, a classification type to which the text information to be classified represented by the word vector set belongs. The test sample set may be used to test the accuracy of the classification of the trained classification model.

S205: and inputting the training sample set into a classification model to be trained for training.

The training sample set comprises a plurality of word vector sets, the word vector sets in the training sample set are input into a classification model to be trained for training, and the classification model to be trained outputs classification results corresponding to the word vector sets in the training sample set respectively.

Further, the terminal processing the text information may train a classification model to be trained based on a logistic regression model, which includes a preset function:

wherein x ∈ Rⁿ⁺¹，w_k∈Rⁿ⁺¹Y denotes classification information, x denotes an input target word vector set, w_kFor the preset parameter, K represents the level of the multi-level classification.

In the training process, a plurality of word vector sets in a training sample set are sequentially input into a classification model to be trained, namely an x value is input, and when K is different, the value of P is calculated according to the function. Calculating the preset parameter w by a maximum likelihood estimation method or a gradient descent method_k. The Maximum Likelihood estimation Method (MLE) is also called Maximum Likelihood estimation or Maximum Likelihood estimation, and is a method for estimating; ladder with adjustable heightThe degree reduction method is one of iteration methods, and when the minimum value of the loss function is solved, the minimum loss function and the model parameter value can be obtained through step-by-step iterative solution of a gradient reduction method.

S206: and when the training times reach a preset threshold value, classifying the classification model of the test sample set input training, and outputting a classification result by the classification model in the training.

The preset threshold is the training frequency of the classification model to be trained set by the user, and the user can set the threshold according to the actual situation without limitation. And when the training times of the classification model to be trained reach a preset threshold value, inputting a plurality of word vector sets in the test sample set into the classification model which is being trained at the moment for classification, and obtaining classification results corresponding to the word vector sets in the test sample set respectively.

It should be noted that the training process of the classification model is repeated until the optimal classification model (the classification model with the accuracy reaching the standard of the test result) is trained. Namely, when the training times reach a preset threshold value, a test is performed (a test sample set is input into a classification model to be trained for classification to obtain a classification result), and if the test result does not reach the standard, the training is continued until the classification model with the test result accuracy reaching the standard is trained.

S207: and determining the accuracy of classification of the classification model in the training according to the classification result.

The test sample set is input into the classification model in training, and the accuracy of the classification model in training can be determined by comparing the results of the classification of the word vector set in the test sample set by the classification model at the moment and whether the classification information corresponding to the word vector set in the test sample set is the same. For example, 10000 word vector sets in the test sample set are calculated, and when the result of classifying the word vector set in the test sample set by the classification model in training is 9023 groups which are the same as the classification information corresponding to the word vector set in the test sample set, the accuracy of the classification model in training is about 90%.

S208: and when the loss function and the accuracy of the classification model in training are both converged, obtaining the trained classification model.

The loss function is a function that maps the value of a random event or its associated random variable to a non-negative real number to represent the "risk" or "loss" of the random event. Accuracy refers to the accuracy of classification model classification in training.

The loss function used in this embodiment is a cross-entropy loss function,

a loss value may be calculated based on the cross entropy loss function. And when the loss function and the accuracy are detected to be converged, the model is proved to be trained, and the trained classification model is obtained.

In this embodiment, when both the loss function and the accuracy rate converge, it is determined that the classification model has been trained, so that overfitting can be prevented, and the classification of the trained classification model is more accurate.

Further, when the classification accuracy of the trained classification model reaches the highest, the preset parameter in the preset function is the optimal parameter.

S209: inputting the target word vector set into a trained classification model for classification processing, and outputting classification information corresponding to the target word vector set by the classification model;

EXAMPLE III

Referring to fig. 6, fig. 6 is a schematic flowchart of a method for processing text information according to still another embodiment of the present invention. The main executing body of the method for processing text information in this embodiment is a terminal for processing text information, and the terminal for processing text information includes, but is not limited to, a mobile terminal such as a smart phone, a tablet computer, a Personal Digital Assistant (PDA), and the like, and may also include a terminal such as a desktop computer. The method of processing text information as shown in fig. 6 may include:

s301: and acquiring text information to be classified.

In this embodiment, S301 is identical to S101 in the embodiment corresponding to fig. 1, and please refer to the related description of S101 in the previous embodiment, which is not repeated herein.

S302: and preprocessing the text information to obtain target text information.

In this embodiment, S302 is identical to S102 in the embodiment corresponding to fig. 1, and please refer to the related description of S302 in the previous embodiment, which is not repeated herein.

S303: and extracting the key words in the target text information through the language representation model to obtain a document word set.

The keywords refer to words, phrases and the like generated after the target text information is divided. And carrying out keyword division on the target text information through a language representation model, and combining the divided keywords according to a sequence to generate a document word set.

And performing word segmentation processing on the target text information through a language representation model to obtain a document word set. The word segmentation processing means that a word sequence in the target text information is divided into a plurality of word sequences through a word segmentation algorithm; the document word set is formed by combining all the word sequence sequences after being divided, namely, the document word set is generated by combining all the word sequences according to the arrangement sequence of each word sequence after being divided.

The language representation model can comprise a word segmentation algorithm, and word segmentation processing is carried out on target text information through the word segmentation algorithm to obtain a document word set. The content in the target text information is divided into a plurality of word sequences through a word segmentation algorithm, and the word sequences are combined in sequence to generate a document word set. The word sequence can be words or single words.

Specifically, a dictionary tree can be generated through a dit.txt dictionary in a word segmentation algorithm, a directed acyclic graph is generated according to target text information to be segmented and the dictionary tree, a maximum probability path is searched in the directed acyclic graph, a word segmentation mode is determined, the target text information is segmented according to the word segmentation mode, and word segmentation results are sequentially combined to generate a document word set.

Txt is a dictionary file format, and consists of a plurality of word units, each unit contains words and explanation information, each word unit starts with a word, and explanation ends. A directed acyclic graph refers to a loop-free directed graph. That is, if a directed graph cannot go from a certain vertex back to the point through several edges, the graph is a directed acyclic graph.

In this embodiment, a directed acyclic graph generated from target text information to be segmented and a dictionary tree includes a plurality of probability paths, each probability path includes a plurality of segments and a frequency of each segment, and a probability of the probability path can be determined by calculating a product of frequencies of all the segments included in each path. And segmenting the target text information according to each segmentation on the path with the maximum probability.

And performing word segmentation processing on the target text information through a Viterbi algorithm to obtain a document word set. The viterbi algorithm is a dynamic programming algorithm that finds the viterbi path that is most likely to produce a sequence of observed events. For example, Chinese vocabulary is labeled according to BEMS four states, B denotes the start position, E denotes the end position, M denotes the middle position, and S denotes the position of the individual word. When the text information to be participled is 'Chinese words are studied all over the world', a BEMS sequence [ S, B, E, S, S, S, B, E, S ] is obtained, continuous B, E are combined together, and S is independently participled to obtain a word segmentation result. The B, E positions correspond to the positions of single Chinese characters in the sentence one by one, and the full/S world/BE all/S in/S science/S Chinese/BE word/S is obtained, so that the sentence is segmented into words.

Further, S303 may further include: S3031-S3035, the concrete is as follows:

s3031: and performing word segmentation processing on the target text information to obtain a plurality of target word segmentation sets.

The word segmentation processing means segmenting target text information into a plurality of words; the word segmentation processing in the embodiment performs multiple word segmentation modes on the target text information; the target participle set is formed by combining a plurality of participles. After the target text information is segmented in multiple word segmentation modes, combining word segmentation results obtained in each segmentation mode according to the arrangement sequence of each word segmentation to generate a target word segmentation set; according to different word segmentation modes, a plurality of different target word segmentation sets can be obtained.

The method can be understood as dividing the target text information in various word dividing modes, listing the possible word dividing modes of the target text information, and generating the corresponding target word dividing sets according to each different word dividing mode.

For example, when the target text information is "Chinese words are studied all over the world", a plurality of word segmentation methods are performed on the "Chinese words are studied all over the world", and the obtained target word segmentation set may be: all/world/all/at/school/china/talk; worldwide/metropolitan/school/chinese; all over the world/school/chinese, etc.

S3032: and generating a dictionary tree through the language representation model, and determining the occurrence frequency of each participle in each target participle set.

The language characterization model may include a segmentation algorithm, through which a dictionary tree is generated, and a frequency of occurrence of each segment in each target segment set is determined. Specifically, the word segmentation algorithm is provided with a dit.txt dictionary, and a dictionary tree can be generated based on the dit.txt dictionary.

And simultaneously acquiring the occurrence frequency of each word in each target word segmentation set, and converting the occurrence frequency of each word into the occurrence frequency of the word.

S3033: generating a directed acyclic graph according to the dictionary tree, each target participle set and the occurrence frequency of each participle; the directed acyclic graph comprises a plurality of probability paths, and each probability path comprises a target word and the occurrence frequency of the target word.

Target participles refer to participles in the probability path. Each probability path may include a plurality of target participles, and a frequency corresponding to each target participle in the probability path.

And generating a directed acyclic graph according to the segmentation mode represented by the target word segmentation set and the occurrence frequency of each word in the target word segmentation set. The directed acyclic graph comprises a plurality of probability paths, each probability path comprises a plurality of target participles and the frequency of each target participle, and the probability of the probability path can be determined by calculating the product of the frequencies of all the target participles in each path.

S3034: determining a word segmentation result based on the language characterization model and the directed acyclic graph.

The language characterization model may include a dynamic programming algorithm, by which a maximum probability path is searched for among probability paths of the directed acyclic graph, and a word segmentation result is determined based on the maximum probability path. The probability of each path can be determined by calculating the product of the frequencies of all target participles included in the path, and when the product of the frequencies of all target participles included in a certain probability path is maximum, the certain probability path is the maximum probability path. The word segmentation result is the expression form of each word in the maximum probability path, and the word segmentation processing can be carried out on the target text information according to the expression form of each word.

Further, S3034 may further include: S30341-S30342, the concrete is as follows:

s30341: and respectively calculating the probability value corresponding to each probability path according to the frequency of the target word segmentation contained in each probability path.

Each probability path may include a plurality of target participles, and the frequency of occurrence of each target participle, and the probability value of the probability path may be obtained by calculating the product of the frequencies of all target participles in the path.

It is worth explaining that when the probability value corresponding to the probability path is calculated according to the frequency of the target participle contained in each probability path, the target participle corresponds to the probability path where the target participle is located. For example, when the target participle is a, calculating the probability value of the probability path a through the target participle a; when the target participle is B, calculating the probability value of the probability path B through the target participle B; and when the target participle is C, calculating the probability value of the probability path C through the target participle C.

S30342: determining a maximum probability path based on the probability value corresponding to each probability path; the maximum probability path is the probability path with the maximum probability value in all probability paths.

The maximum probability path refers to the path with the highest probability value in all probability paths.

Specifically, probability values corresponding to a plurality of probability paths are obtained through calculation, and the probability values are compared, wherein the path with the largest probability value is the maximum probability path.

S30343: and determining a word segmentation result according to the target word segmentation contained in the maximum probability path.

The expression form of each target word segmentation in the maximum probability path is the word segmentation result, and word segmentation processing can be performed on target text information according to the expression form of each target word segmentation.

For example, when the expression form of each target word segmentation in the maximum probability path is "all/world/all/in/school/china/word/", the word segmentation processing can be performed on the target text information according to the division mode of "all/world/all/in/school/china/word/"; when the expression form of each target word segmentation in the maximum probability path is 'all over the world/all/in/school/Chinese words/', the word segmentation processing can be carried out on the target text information according to the division mode of 'all over the world/all/in/school/Chinese words/'.

S3035: and generating the document word set according to the word segmentation result.

And performing word segmentation processing on the target text information according to the expression form of each word segmentation in the maximum probability path, and combining all word sequences to generate a document word set according to the arrangement sequence of each word sequence after word segmentation.

S304: and converting each document word in the document word set into a word vector respectively.

The language representation model can process the document word set by adopting matrix operation, and each document word in the document word set is converted into a word vector corresponding to the document word set. After conversion, each document word has a corresponding word vector.

S305: and combining all word vectors in the document word set based on the arrangement sequence of each document word in the document word set to obtain a target word vector set.

The target word vector set is formed by sequentially combining all the converted word vectors, namely, all the word vectors are combined to generate a target word vector set according to the arrangement sequence of each converted word vector.

Specifically, each document word in the document word set is correspondingly converted into a word vector, and all the converted word vectors are combined together and sequentially arranged according to the arrangement sequence of each document word in the document word set to generate a target word vector set.

S306: inputting the target word vector set into a trained classification model for classification processing, and outputting classification information corresponding to the target word vector set by the classification model;

Example four

The embodiment illustrates an implementation flow of the present invention. As shown in fig. 7, when the text information to be classified is acquired, the text information to be classified is preprocessed, for example, redundant information in the text information to be processed is removed, or key characters in the text information to be processed are extracted, then the data information obtained through preprocessing is input into a trained language representation model for processing, and then the information obtained through processing by the language representation model is input into a classification model for classification, so as to obtain a final classification result.

According to the scheme, the trained language representation model is used for converting the preprocessed text information into the word vector set, so that the obtained vector semantic information is rich, and therefore the classification result obtained by classifying the word vector set through the trained classification model is high in accuracy; and the trained language representation model and the classification model are adopted to process the text information, so that the speed of processing the text information is improved.

Referring to fig. 8, fig. 8 is a schematic diagram of a terminal for processing text messages according to an embodiment of the present invention. The terminal for processing text information comprises units for performing the steps in the embodiments corresponding to fig. 1, 5, 6, 7. Please refer to fig. 1, fig. 5, fig. 6, and fig. 7 for the corresponding embodiments. For convenience of explanation, only the portions related to the present embodiment are shown. Referring to fig. 8, the terminal 8 for processing text information includes:

an obtaining unit 410, configured to obtain text information to be classified;

the preprocessing unit 420 is used for preprocessing the text information to obtain target text information;

the processing unit 430 is configured to input the target text information into a trained language representation model for processing, so as to obtain a target word vector set of the target text information;

the language characterization model is obtained by training based on the corresponding relation between the text information in the sample set and the classification types corresponding to the text information in the sample set;

the classification unit 440 is configured to input the target word vector set into a trained classification model for classification, where the classification model outputs classification information corresponding to the target word vector set;

Further, the preprocessing unit 420 is specifically configured to:

extracting effective characters in the text information;

and combining the effective characters to generate target text information.

Further, the processing unit 430 includes:

the extraction unit is used for extracting the key words in the target text information through the language representation model to obtain a document word set;

the conversion unit is used for converting each document word in the document word set into a word vector respectively;

and the combining unit is used for combining all word vectors in the document word set based on the arrangement sequence of each document word in the document word set to obtain a target word vector set.

Further, the extraction unit includes:

the word segmentation processing unit is used for carrying out word segmentation processing on the target text information to obtain a plurality of target word segmentation sets;

the first determining unit is used for generating a dictionary tree through the language representation model and determining the occurrence frequency of each participle in each target participle set;

the first generation unit is used for generating a directed acyclic graph according to the dictionary tree, each target participle set and the occurrence frequency of each participle; the directed acyclic graph comprises a plurality of probability paths, and each probability path comprises a target word segmentation and the occurrence frequency of the target word segmentation;

a second determining unit, configured to determine a word segmentation result based on the language representation model and the directed acyclic graph;

and the second generation unit is used for generating the document word set according to the word segmentation result.

Further, the second determining unit is specifically configured to:

respectively calculating the probability value corresponding to each probability path according to the frequency of the target word segmentation contained in each probability path;

determining a maximum probability path based on the probability value corresponding to each probability path; the maximum probability path is the probability path with the maximum probability value in all probability paths;

and determining a word segmentation result according to the target word segmentation contained in the maximum probability path.

Further, the terminal for processing text information further includes:

and the sample set acquisition unit is used for acquiring and acquiring a training sample set and a test sample set.

The training unit is used for inputting the training sample set into a classification model to be trained for training;

the output unit is used for classifying the classification model of the test sample set input training when the training times reach a preset threshold value, and the classification model in the training outputs a classification result;

the accuracy determining unit is used for determining the accuracy of classification of the classification model in the training according to the classification result;

and the classification model generation unit is used for obtaining the trained classification model when the loss function and the accuracy of the classification model in the training are both converged.

Referring to fig. 9, fig. 9 is a schematic diagram of a terminal for processing text messages according to another embodiment of the present invention. As shown in fig. 9, the terminal 5 of the embodiment that processes text information includes: a processor 50, a memory 51 and a computer program 52 stored in said memory 51 and executable on said processor 50. The processor 50, when executing the computer program 52, implements the steps in the above-described method embodiments of processing images for respective terminals processing text information, such as S101 to S104 shown in fig. 1. Alternatively, the processor 50, when executing the computer program 52, implements the functions of the units in the device embodiments, such as the functions of the units 410 to 440 shown in fig. 8.

Illustratively, the computer program 52 may be divided into one or more units, which are stored in the memory 51 and executed by the processor 50 to accomplish the present invention. The one or more units may be a series of computer program instruction segments capable of performing specific functions for describing the execution of the computer program 52 in the terminal 5 for processing text information. For example, the computer program 52 may be divided into an acquisition unit, a preprocessing unit, a processing unit, and a classification unit, each unit having the specific functions as described above.

The terminal for processing text information may include, but is not limited to, a processor 50 and a memory 51. It will be appreciated by a person skilled in the art that fig. 9 is only an example of a terminal 5 for processing text information and does not constitute a limitation of a terminal 5 for processing text information and may comprise more or less components than those shown, or some components may be combined, or different components, for example, the terminal for processing text information may further comprise a terminal for inputting and outputting processed text information, a terminal for network access to processed text information, a bus, etc.

The Processor 50 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 51 may be an internal storage unit of the terminal 5, such as a hard disk or a memory of the terminal 5. The memory 51 may also be a terminal for storing processed text information outside the terminal 5, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the terminal 5 for processing text information. Further, the memory 51 may also include both an internal storage unit of the terminal 5 and an external storage terminal. The memory 51 is used for storing the computer program and other programs and data required by the terminal. The memory 51 may also be used to temporarily store data that has been output or is to be output.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims

1. A method of processing textual information, comprising:

acquiring text information to be classified;

preprocessing the text information to obtain target text information;

inputting the target text information into a trained language representation model for processing to obtain a target word vector set of the target text information; the language characterization model is obtained by training based on the corresponding relation between the text information in the sample set and the classification types corresponding to the text information in the sample set;

inputting the target word vector set into a trained classification model for classification processing, and outputting classification information corresponding to the target word vector set by the classification model; the classification model is obtained by training based on the corresponding relation between the word vector set in the word vector sample set and the classification information corresponding to the word vector set; and the classification information corresponding to the word vector set in the word vector sample set is used for representing the classification type of the text information.

2. The method of claim 1, wherein the preprocessing the text information to obtain target text information comprises:

extracting effective characters in the text information;

and combining the effective characters to generate target text information.

3. The method of claim 1, wherein the inputting the target text information into a trained language characterization model for processing to obtain a target word vector set of the target text information comprises:

extracting key words in the target text information through the language representation model to obtain a document word set;

converting each document word in the document word set into a word vector respectively;

and combining all word vectors in the document word set based on the arrangement sequence of each document word in the document word set to obtain a target word vector set.

4. The method of claim 3, wherein extracting the keywords in the target text information through the language representation model to obtain a document word set comprises:

performing word segmentation processing on the target text information to obtain a plurality of target word segmentation sets;

generating a dictionary tree through the language representation model, and determining the occurrence frequency of each participle in each target participle set;

generating a directed acyclic graph according to the dictionary tree, each target participle set and the occurrence frequency of each participle; the directed acyclic graph comprises a plurality of probability paths, and each probability path comprises a target word segmentation and the occurrence frequency of the target word segmentation;

determining a word segmentation result based on the language characterization model and the directed acyclic graph;

and generating the document word set according to the word segmentation result.

5. The method of claim 4, wherein determining a segmentation result based on the language characterization model and the directed acyclic graph comprises:

6. The method of any of claims 1 to 5, wherein the trained classification model is trained by:

acquiring a training sample set and a test sample set;

inputting the training sample set into a classification model to be trained for training;

when the training times reach a preset threshold value, classifying the classification model of the test sample set input training, and outputting a classification result by the classification model in the training;

determining the accuracy of classification of the classification model in the training according to the classification result;

and when the loss function and the accuracy of the classification model in training are both converged, obtaining the trained classification model.

7. A terminal for processing text information, comprising:

the acquiring unit is used for acquiring text information to be classified;

the preprocessing unit is used for preprocessing the text information to obtain target text information;

the processing unit is used for inputting the target text information into a trained language representation model for processing to obtain a target word vector set of the target text information;

the classification unit is used for inputting the target word vector set into a trained classification model for classification processing, and the classification model outputs classification information corresponding to the target word vector set;

8. The terminal for processing text information according to claim 7, wherein the preprocessing unit is specifically configured to:

extracting effective characters in the text information;

and combining the effective characters to generate target text information.

9. A terminal for processing text information, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 6 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.