CN110019794B - Text resource classification method and device, storage medium and electronic device - Google Patents

Text resource classification method and device, storage medium and electronic device Download PDF

Info

Publication number
CN110019794B
CN110019794B CN201711088170.XA CN201711088170A CN110019794B CN 110019794 B CN110019794 B CN 110019794B CN 201711088170 A CN201711088170 A CN 201711088170A CN 110019794 B CN110019794 B CN 110019794B
Authority
CN
China
Prior art keywords
text
target
training
classifier
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711088170.XA
Other languages
Chinese (zh)
Other versions
CN110019794A (en
Inventor
常卓
范欣
温旭
李探
王枷淇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Beijing Co Ltd
Original Assignee
Tencent Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Beijing Co Ltd filed Critical Tencent Technology Beijing Co Ltd
Priority to CN201711088170.XA priority Critical patent/CN110019794B/en
Publication of CN110019794A publication Critical patent/CN110019794A/en
Application granted granted Critical
Publication of CN110019794B publication Critical patent/CN110019794B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a text resource classification method and device, a storage medium and an electronic device. Wherein the method comprises the following steps: determining a target text type of the text resource to be classified in a plurality of text types according to the text word number of the text resource to be classified, wherein the plurality of text types are text types divided according to a text word number rule; extracting target features from the text resources to be classified by adopting a target feature extraction mode corresponding to the target text type in a plurality of feature extraction modes; and determining a target text theme corresponding to the target feature from target corresponding relations corresponding to the target text types, wherein the target corresponding relations are used for indicating the corresponding relations between the features corresponding to the text resources of the target text types and the text theme. The method and the device solve the technical problem of poor text resource classification accuracy in the prior art.

Description

Text resource classification method and device, storage medium and electronic device
Technical Field
The present invention relates to the field of computers, and in particular, to a method and apparatus for classifying text resources, a storage medium, and an electronic apparatus.
Background
Text classification is often applied in a number of fields, such as news, search, recommendation systems, etc., as a classical natural language processing problem. Text classification is often the fundamental feature of these systems, and its classification effect has a crucial effect on the overall system. However, the current text classification methods are based on a single classification algorithm using different parameter selections to find the optimal solution under the current algorithm. This approach has poor accuracy in classifying text resources.
In view of the above problems, no effective solution has been proposed at present.
Disclosure of Invention
The embodiment of the invention provides a text resource classification method, a text resource classification device, a storage medium and an electronic device, which are used for at least solving the technical problem of poor text resource classification accuracy in the prior art.
According to an aspect of the embodiment of the present invention, there is provided a text resource classification method, including: determining a target text type of a text resource to be classified in a plurality of text types according to the text word number of the text resource to be classified, wherein the plurality of text types are text types divided according to a text word number rule; extracting target features from the text resources to be classified by adopting a target feature extraction mode corresponding to the target text type in a plurality of feature extraction modes; and determining a target text theme corresponding to the target feature from target corresponding relations corresponding to the target text types, wherein the target corresponding relations are used for indicating corresponding relations between features corresponding to text resources of the target text types and the text theme.
According to another aspect of the embodiment of the present invention, there is also provided a text resource classification apparatus, including: the first determining module is used for determining a target text type of the text resource to be classified in a plurality of text types according to the text word number of the text resource to be classified, wherein the plurality of text types are text types divided according to a text word number rule; the extraction module is used for extracting target features from the text resources to be classified by adopting a target feature extraction mode corresponding to the target text type in a plurality of feature extraction modes; and the second determining module is used for determining a target text theme corresponding to the target feature from a target corresponding relation corresponding to the target text type, wherein the target corresponding relation is used for indicating the corresponding relation between the feature corresponding to the text resource of the target text type and the text theme.
According to another aspect of the embodiments of the present invention, there is also provided a storage medium including a stored program, wherein the program, when run, performs the method described in any one of the above.
According to another aspect of the embodiments of the present invention, there is also provided an electronic device including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor executing the method described in any one of the above by the computer program.
According to the text word number of the text resource to be classified, determining a target text type of the text resource to be classified in a plurality of text types, wherein the plurality of text types are text types divided according to a text word number rule; extracting target features from the text resources to be classified by adopting a target feature extraction mode corresponding to the target text type in a plurality of feature extraction modes; and determining a target text theme corresponding to the target feature from target corresponding relations corresponding to the target text types, wherein the target corresponding relations are used for indicating the corresponding relations between the features corresponding to the text resources of the target text types and the text theme. That is, the text resources with different lengths are suitable for different feature extraction modes and text classification models, the text resources are divided into a plurality of text types according to the text word number of the text resources, each text type corresponds to a feature extraction mode of the text resources suitable for the text type, each text type also corresponds to a corresponding relation between a feature and a text topic, when the text resources are classified, the target text type to which the text resources to be classified belong is determined according to the text word number of the text resources to be classified, the target feature of the text resources to be classified is extracted by adopting the target feature extraction mode corresponding to the target text type, and the target text topic of the text resources to be classified is determined according to the target text topic corresponding to the target text type, so that the extraction mode and the corresponding relation adopted when the text resources are classified are matched with the sub-type of the text resources, thereby improving the accuracy of the classification of the text resources and further overcoming the problem of poor accuracy of the classification of the text resources in the prior art.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiments of the invention and together with the description serve to explain the invention and do not constitute a limitation on the invention. In the drawings:
fig. 1 is a schematic view of an application environment of an alternative video image playing method according to an embodiment of the present invention;
FIG. 2 is a schematic view of an application environment of another alternative video image playing method according to an embodiment of the present invention;
FIG. 3 is a flow chart of an alternative video image playing method according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of an alternative video image playing method according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of another alternative video image playing method according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of yet another alternative video image playing method according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of yet another alternative video image playing method according to an embodiment of the present invention;
FIG. 8 is a schematic diagram of an alternative video image playback device in accordance with an embodiment of the invention;
fig. 9 is a schematic view of an application scenario of an alternative video image playing method according to an embodiment of the present invention;
Fig. 10 is a schematic view of an application scenario of another alternative video image playing method according to an embodiment of the present invention; and
fig. 11 is a schematic diagram of an alternative electronic device according to an embodiment of the invention.
Detailed Description
In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
In the embodiment of the invention, an embodiment of a method for classifying text resources is provided. As an alternative implementation manner, the text resource classification method may be, but is not limited to, applied to an application environment as shown in fig. 1, where a client 102 is connected to a server 104 through a network 106, and the client 102 is configured to send a text resource to be classified to the server 104 through the network 106; the server 104 is configured to determine, according to the number of text words of the text resource to be classified, a target text type to which the text resource to be classified belongs in a plurality of text types, extract a target feature from the text resource to be classified by adopting a target feature extraction mode corresponding to the target text type in a plurality of feature extraction modes, and determine a target text topic corresponding to the target feature from a target correspondence corresponding to the target text type; the text types are text types divided according to a text word number rule, and the target corresponding relation is used for indicating the corresponding relation between the text topic and the characteristics corresponding to the text resource with the text type being the target text type.
Optionally, in this embodiment, as shown in fig. 2, the application environment may further include, but is not limited to, a database 202, where the database 202 is connected to the server 104, and the database 202 is used to store a text resource and a correspondence between the text resource and a text topic. The server 104 stores the categorized text resources and the correspondence between the text resources and the target text topic in the database 202. When pushing text resources for the client 102, the server 104 determines the text resources to be pushed corresponding to the text topics to be pushed from the corresponding relation between the text resources and the text topics stored in the database 202, and obtains the pushed text resources from the database 202 to push to the client 102.
In this embodiment, text resources with different lengths are suitable for different feature extraction modes and text classification models, the text resources are divided into a plurality of text types according to the text word number of the text resources, each text type corresponds to a feature extraction mode suitable for the text resources of the text type, each text type also corresponds to a corresponding relation between a feature and a text topic, when classifying the text resources, the target text type to which the text resources to be classified belong is determined according to the text word number of the text resources to be classified, the target feature extraction mode corresponding to the target text type is adopted to extract the target feature of the text resources to be classified, and the target text topic of the text resources to be classified is determined according to the target text topic corresponding to the target text type, so that the extraction mode and the corresponding relation adopted when classifying the text resources are matched with the partition type of the text resources, thereby improving the accuracy of the text resource classification and further overcoming the problem of poor text resource classification accuracy in the prior art.
Optionally, in this embodiment, the client may include, but is not limited to, at least one of: cell phones, tablet computers, notebook computers, desktop PCs, digital televisions, and other hardware devices that perform region sharing. The network may include, but is not limited to, at least one of: wide area network, metropolitan area network, local area network. The above is merely an example, and the present embodiment is not limited thereto.
According to an embodiment of the present invention, there is provided a text resource classification method, as shown in fig. 3, including:
s302, determining a target text type of a text resource to be classified in a plurality of text types according to the text word number of the text resource to be classified, wherein the plurality of text types are text types divided according to a text word number rule;
s304, extracting target features from the text resources to be classified by adopting a target feature extraction mode corresponding to the target text type in a plurality of feature extraction modes;
s306, determining a target text theme corresponding to the target feature from target corresponding relations corresponding to the target text types, wherein the target corresponding relations are used for indicating corresponding relations between features corresponding to text resources of the target text types and the text theme.
Alternatively, in this embodiment, the above-mentioned text resource classification method may be applied, but not limited to, in a scenario in which text resources are classified. For example: a scene in which a news reading class application classifies news text, a scene in which a search class application classifies web page text, a scene in which a text pushing class application classifies text resources, and so on. The above is merely an example, and there is no limitation in this embodiment.
Alternatively, in the present embodiment, the text word count rule may be, but not limited to, a rule for dividing text types by text word count, and the text word count rule may be a classification rule set according to a distribution of text word count of a text resource stored in a database. For example: the text word count rule may be set to divide the text resources into four categories, type 1, type 2, type 3, and type 4, wherein type 1 includes text resources having a text word count of less than or equal to 50 words, type 2 includes text resources having a text word count of greater than 50 words and less than or equal to 500 words, type 3 includes text resources having a text word count of greater than 500 words and less than or equal to 10000 words, and type 4 includes text resources having a text word count of greater than 10000 words.
Alternatively, in the present embodiment, the characteristics of the text may include, but are not limited to: word features, semantic features, and structural features, among others. As shown in fig. 4, the word features may include, but are not limited to, local features and global features, wherein the extraction manner of the local features may include, but is not limited to: TF mode, ngram mode, etc., the global feature extraction mode may include, but is not limited to: DF mode, IDF mode, RF mode, chi-Score mode, one mode, and the like. The extraction of word features may be performed by combining local features and global features, for example: TF-DF mode, TF-IDF mode, TF-RF mode, TF-Chi-Score mode, and the like. Semantic features, also known as multidimensional features, may be extracted by ways that include, but are not limited to: LSI method, LDA method, word2Vec method, wherein LSI method can be used for one-sense multi-Word and feature dimension reduction, etc., LDA method can be used for one-sense multi-Word or one-Word multi-sense, word2Vec method can be used for expanding Word sense. Structural features may include, but are not limited to, article structures and media structures, among others. Among other things, article structures may include, but are not limited to: title, text paragraph, paragraph location, etc. Media structures may include, but are not limited to: media history distribution, high purity media, regular media, and so forth.
Alternatively, in the present embodiment, the correspondence between the feature corresponding to the text resource and the text topic may be represented using, but not limited to, a natural language classification model. The natural language classification model may include, but is not limited to: a Naive Bayes (NB) classification model, a Support Vector Machine (SVM) model, a decision tree model, a fast text classifier (FastText) model, a maximum entropy (MaxEnt) model, and so forth.
The idea underlying the Naive Bayes (NB) classification model is: for a given item to be classified, solving the probability of each category under the condition that the item appears, and considering which category the item to be classified belongs to.
The support vector machine model improves the generalization capability of the learning machine by seeking the minimum structural risk, and achieves the minimization of experience risk and confidence range, so that the aim of obtaining good statistical rules under the condition of less statistical sample size is achieved. In popular terms, the model is a class-two classification model, the basic model is defined as a linear classifier with the largest interval on the feature space, namely, the learning strategy of the support vector machine is the interval maximization, and finally, the model can be converted into a solution of a convex quadratic programming problem.
The decision tree is a tree structure (which may be a binary tree or a non-binary tree). Each non-leaf node of which represents a test on a characteristic attribute, each branch representing the output of this characteristic attribute over a range of values, and each leaf node storing a class. The decision making process using the decision tree is to start from the root node, test the corresponding characteristic attribute in the item to be classified, select the output branch according to the value until the leaf node is reached, and take the category stored by the leaf node as the decision result.
Therefore, through the steps, the text resources with different lengths are suitable for different feature extraction modes and text classification models, the text resources are divided into a plurality of text types according to the text word numbers of the text resources, each text type corresponds to one feature extraction mode suitable for the text resources of the text type, each text type also corresponds to a corresponding relation between a feature and a text topic, when the text resources are classified, the target text type to which the text resources to be classified belong is determined according to the text word numbers of the text resources to be classified, the target feature extraction mode corresponding to the target text type is adopted to extract the target feature of the text resources to be classified, and the target text topic of the text resources to be classified is determined according to the target text topic corresponding to the target text type, so that the extraction mode and the corresponding relation adopted when the text resources are classified are matched with the partition type of the text resources, and the accuracy of the text resources is improved, and the problem of poor classification accuracy of the text resources in the prior art is further overcome.
As an alternative, determining, according to the number of text words of the text resource to be classified, a target text type to which the text resource to be classified belongs in the plurality of text types includes:
s1, counting the number of text words of a text resource to be classified;
s2, determining that the target text type to which the text resource to be classified belongs is a first text type under the condition that the text word number is smaller than or equal to the first word number;
s3, determining that the target text type to which the text resource to be classified belongs is a second text type under the condition that the text word number is larger than the first word number and smaller than or equal to the second word number;
and S4, determining that the target text type to which the text resource to be classified belongs is a third text type under the condition that the text word number is larger than the second word number.
Alternatively, in this embodiment, the above-mentioned plurality of text types divided according to the number of text words of the text resource may include, but is not limited to, three text types, the first text type may be a short text, the second text type may be a medium-length text, and the third text type may be a long text, where the division criterion for the number of text words may be set according to, but not limited to, the specific case of the text stored in the database. For example: there are 10000 text resources in the database, wherein the minimum number of words is 10 words, the number of words is 2000 words, the number of words is between 10 and 500 words accounts for 35% of the total number of texts, the number of words is between 500 and 1550 words accounts for 30% of the total number of texts, the number of words is between 1550 and 2000 words accounts for 35% of the total number of texts, then the text type of the text with the number of words below 500 can be set as short text, the text type of the text with the number of words between 500 and 1550 words is medium length text, and the text type of the text with the number of words above 1550 is long text. As shown in fig. 5, the text resource to be classified is obtained, the number of text words of the text resource to be classified is counted to obtain 756 words of the number of text words of the text resource to be classified, and the text resource to be classified can be determined to be a medium text according to the set rule of dividing the text type.
As an alternative, in the case that the target text type of the text resource to be classified is the first text type, extracting the target feature from the text resource to be classified by using a target feature extraction mode corresponding to the target text type among the feature extraction modes includes:
s1, word segmentation is carried out on text resources to be classified, and a first word segmentation word is obtained;
s2, acquiring first target implicit features corresponding to first word segmentation words from first corresponding relations, wherein the first corresponding relations are corresponding relations between each first word and the implicit features used for representing each first word, and the first words comprise the first word segmentation words;
s3, determining the first target implicit characteristic as a target characteristic.
Alternatively, in this embodiment, for a text resource of a first text type with a smaller number of text words, since there may be fewer first word segmentation words obtained by segmenting the text resource, the features of the text may not be fully represented, so that the implicit features corresponding to the first word segmentation words may be obtained in the first correspondence relationship for representing the correspondence relationship between each first word and the implicit feature, and the implicit feature may be used as the target feature of the text resource. Wherein implicit features are used to characterize each first word.
In an alternative embodiment, the implicit feature may be training the Word2vec model with a large number of history articles in the news base as corpus. After a large number of historical articles in a news base are segmented and deactivated, keywords are extracted, training is conducted by using a Word2vec model, each keyword is converted into a vector representation, and the representation relationship can be used for representing the similarity between words. Word vector characterization may be an implicit feature.
As an alternative, before determining the target text topic corresponding to the target feature from the target correspondence corresponding to the target text type, the method further includes:
s1, acquiring a first training text;
s2, classifying the first training texts according to the text word number rule to obtain a plurality of first training text sets;
and S3, training the classifier corresponding to each first training text set through each first training text set in the plurality of first training text sets to obtain a plurality of target classifiers, wherein the plurality of target classifiers are used for indicating the corresponding relation between the characteristics corresponding to the text resources of each text type and the text subjects.
Optionally, in this embodiment, the same text word number rule is used for classifying the first training text as when classifying the text resource to be classified, so that it is ensured that the training text and the text resource are classified by using the same scale, and thus the accuracy of determining the topic of the text resource is higher.
Alternatively, in the present embodiment, the correspondence between the features and the text topic may be determined, but is not limited to, by training a classifier using training text. And dividing the acquired first training text into a plurality of training samples of the text types according to the number of the text words, and training the classifier corresponding to the type by using the training samples of each type.
In an alternative embodiment, as shown in fig. 6, the first training text is divided into three different types of training texts according to the number of text words: training text A corresponding to the first text type, training text B corresponding to the second text type and training text C corresponding to the third text type, and three classifiers corresponding to the training texts of different types are respectively: classifier a, classifier B and classifier C. Training the classifier A by using the training text A to obtain a target classifier A, training the classifier B by using the training text B to obtain a target classifier B, and training the classifier C by using the training text C to obtain a target classifier C. The target classifier a indicates a correspondence between a feature corresponding to a text resource of a first text type and a text topic, and the target classifier a may be used to classify the text resource of the first text type. The target classifier B indicates a correspondence between the feature corresponding to the text resource of the second text type and the text topic, and the target classifier B may be used to classify the text resource of the second text type. The target classifier C indicates a correspondence between a feature corresponding to a text resource of the third text type and a text topic, and the target classifier C may be used to classify the text resource of the third text type.
As an alternative, the classifier corresponding to each first training text set may include multiple types of classification models, where training, through each first training text set, the classifier corresponding to each first training text set, to obtain the target classifier includes:
s1, training a plurality of types of classification models through each first training text set to obtain a first classifier comprising a plurality of target classification models, wherein the plurality of target classification models are used for indicating the corresponding relation between the characteristics corresponding to the text resources and the topics under the type of classification models;
s2, acquiring a second training text;
s3, classifying the second training texts according to the text word number rule to obtain a plurality of second training text sets;
s4, inputting each second training text set in the plurality of second training text sets into a first classifier corresponding to the second training text set to obtain a plurality of output results of each first classifier, wherein the output results are used for indicating the corresponding relation between each second training text set and the topic confidence coefficient, and the topic confidence coefficient is used for indicating the probability that each training text in each second training text set belongs to each topic in all topics;
And S5, training a decision classifier according to the multiple output results to obtain a second classifier, and taking the second classifier as a target classifier, wherein the decision classifier is used for indicating the contribution weight of the output result of each type of classification model in the multiple types of classification models to each topic of all topics belonging to each training text.
Optionally, in this embodiment, when the text resources to be classified, the first training text and the second training text are classified according to the number of text words, the same classification rule is adopted, so that the classification scale of the text according to the number of text words is ensured to be the same, and therefore, the accuracy of determining the topic of the text resources is higher.
Optionally, in this embodiment, the classifier corresponding to each first training text set includes multiple types of classification models. The plurality of types of classification models may include, but are not limited to: NB (naive bayes), SVM, fastText, maxEnt, etc. In the process of determining the target classifier corresponding to each text type, a first training text set split from a first training text is used for training a plurality of types of classification models in the classifier corresponding to the first training text set to obtain the first classifier, a second training text is obtained, the second training text is divided according to the same standard to obtain a second training text set, the second training text set is input into the corresponding first classifier to obtain a corresponding output result, a decision classifier is trained by using the output result corresponding to each text type to obtain the second classifier, and the second classifier is used as the target classifier corresponding to the text type. Wherein the decision classifier is operable to indicate a contribution weight of the output result of the classification model of each type under each text type to each of the subjects to which each training text belongs. That is, the decision classifier indicates how much each type of classification model affects the text topic of the text resource.
In an alternative embodiment, the decision classifier may be trained by the second training text of the label, but instead of directly taking the initial content of the training samples, the training samples are translated into a confidence probability that each sample belongs to a certain class. For example: sports 1:0.31, entertainment 1: and then regarding the sports 1:0.31 as a sports feature obtained by the first classification model, wherein the feature value is 0.31, and training by the decision classifier according to a training sample to obtain the contribution weight of each classification model in a certain class finally, such as the contribution value of sports 1 in the decision classifier, the contribution value of sports 1 in entertainment and the like, so that the problem that no contribution value exists between similar classes can be solved. For example, an article may be both a cultural class and a historical class, and if a decision classifier is not available, a history is cast by voting, and cultural components are lost. Compared with the voting mode, for example: the output of the four classification models, three sports and one entertainment, that is identified as sports. In this embodiment, the probabilities of a certain class are output by four classification models, for example: the first classification model, 31% sports, 30% entertainment, 40% health, 52% life, the second, third and fourth classifier outputs are similar, and the decision classifier will calculate the topic to which the text resource belongs comprehensively, possibly health and life.
Alternatively, in the present embodiment, for a classification model without a confidence output, its output may be converted to a confidence output, for example: the class models of NB and SVM have no confidence output because of the algorithm design, so that the output of the class models can be converted into probability output by adopting an order preserving regression mode aiming at the NB classifier. The SVM may convert its output to a probabilistic output in a Platt scaling manner. Thus, the output of all classification models is converted into the probability space of [0,1], and can be used as the input of a decision classifier to train a final target classifier.
Alternatively, in this embodiment, the first training text and the second training text may be, but not limited to, 2 training texts that are obtained by randomly segmenting the training corpus according to the golden section in a large training sample space through a random sampling process.
In an alternative embodiment, as shown in fig. 7, the total training text is divided into a first training text and a second training text, and the first training text is divided into three different types of training texts according to the number of text words: training text A corresponding to the first text type, training text B corresponding to the second text type and training text C corresponding to the third text type, and three classifiers corresponding to the training texts of different types are respectively: classifier a, classifier B and classifier C. The classifier A comprises a classification model A1, a classification model A2, a classification model A3 and a classification model A4, the classifier B comprises a classification model B1, a classification model B2, a classification model B3 and a classification model B4, the classifier C comprises a classification model C1, a classification model C2, a classification model C3 and a classification model C4, the classification model in the classifier A is trained by using a training text A to obtain a first classifier A, the classification model in the classifier B is trained by using a training text B to obtain a first classifier B, and the classification model in the classifier C is trained by using a training text C to obtain a first classifier C. Dividing the second training text into three different types of training texts according to the number of text words: training text M corresponding to the first text type, training text N corresponding to the second text type and training text P corresponding to the third text type. The training text M is input into a first classifier A to obtain an output result A, the training text N is input into a first classifier B to obtain an output result B, and the training text P is input into a first classifier C to obtain an output result C. And training the decision classifier A by using the output result A to obtain a target classifier A, training the decision classifier B by using the output result B to obtain a target classifier B, and training the decision classifier C by using the output result C to obtain a target classifier C.
As an optional solution, when the text type of the training text included in the first training text set is a first text type, and the number of text words of the text resource of the first text type is less than or equal to the first number of words, training, by each first training text set, a classifier corresponding to each first training text set includes:
s1, word segmentation is carried out on a first training text set, and second word segmentation words are obtained;
s2, obtaining a topic-word matrix and a document-topic matrix of the first training text set;
s3, obtaining second target implicit features corresponding to second word-division words from second corresponding relations, wherein the second corresponding relations are corresponding relations between each second word and the implicit features used for representing each second word, and the second words comprise the second word-division words;
s4, converting the theme-word matrix into a theme-implicit feature matrix according to a second target implicit feature corresponding to the second word;
s5, training the classifier corresponding to the first training text set through the theme-implicit characteristic matrix and the document-theme matrix.
Alternatively, in this embodiment, a topic-word matrix may be used to indicate a distribution relationship between topics and second word-words, and a document-topic matrix may be used to indicate a distribution relationship between the first training text set and topics.
Alternatively, in this embodiment, for text resources with a small number of text words, training the classifier using training text may convert words extracted therefrom into implicit features.
In an alternative embodiment, the word meaning may not be accurately determined by using the word alone because the word number of the short text is too small, the case of word multi-meaning and word multi-meaning is obvious in the short text, and the topic model based on the LDA can well solve the case of word multi-meaning and word multi-meaning, but if the training sample of the LDA is too short, it is difficult to construct a topic model with high quality. Therefore, in this embodiment, an LDA algorithm based on implicit features, i.e. an LDA4V model, is proposed for the feature extraction method corresponding to the short text, and based on the LDA algorithm, the dirichlet distribution of topic-word (equivalent to the topic-word matrix) is replaced by a combination of two distribution matrices: dirichlet distribution and implicit features of topic-word. Wherein the implicit features are generated by a word vector model trained from a number of external corpora. The word vector contains unigram and bigram bi-patterns. In this embodiment, implicit features are added, and two distributed matrices (topic-word matrix and word-implicit feature matrix) are used to represent the relationship between the implicit features and the topic, and the two distributed matrices finally perform sampling operation to generate a plurality of words corresponding to the topic, so that the extracted feature words are richer and have more definite meaning.
As an alternative, determining the target text theme corresponding to the target feature from the target correspondence corresponding to the target text type includes:
s1, inputting target characteristics into a third classifier corresponding to the target text type in a plurality of target classifiers to obtain an output result of the third classifier;
s2, determining the output result of the third classifier as a target text theme.
Optionally, in this embodiment, each target classifier represents a correspondence between a text resource feature of a corresponding text type and a text topic. Determining a target classifier corresponding to the target text type according to the target text type to which the text resource to be classified belongs, namely determining a third classifier, and inputting target features extracted from the text resource to be classified into the third classifier to obtain an output result of the third classifier as a target text subject of the text resource to be classified.
It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present invention is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present invention. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present invention.
From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present invention.
According to an embodiment of the present invention, there is also provided a text resource classification apparatus for implementing the above text resource classification method, as shown in fig. 8, where the apparatus includes:
1) A first determining module 82, configured to determine, according to a text word number of the text resource to be classified, a target text type to which the text resource to be classified belongs in a plurality of text types, where the plurality of text types are text types divided according to a text word number rule;
2) An extracting module 84, configured to extract a target feature from the text resource to be classified by using a target feature extraction mode corresponding to a target text type among a plurality of feature extraction modes, where the plurality of text types correspond to the plurality of feature extraction modes;
3) The second determining module 86 is configured to determine a target text topic corresponding to the target feature from a target correspondence corresponding to the target text type, where the target correspondence is used to indicate a correspondence between a feature corresponding to a text resource of the target text type and the text topic.
Alternatively, in this embodiment, the above-mentioned text resource classification device may be applied, but not limited to, in a scenario of classifying text resources. For example: a scene in which a news reading class application classifies news text, a scene in which a search class application classifies web page text, a scene in which a text pushing class application classifies text resources, and so on. The above is merely an example, and there is no limitation in this embodiment.
Alternatively, in the present embodiment, the text word count rule may be, but not limited to, a rule for dividing text types by text word count, and the text word count rule may be a classification rule set according to a distribution of text word count of a text resource stored in a database. For example: the text word count rule may be set to divide the text resources into four categories, type 1, type 2, type 3, and type 4, wherein type 1 includes text resources having a text word count of less than or equal to 50 words, type 2 includes text resources having a text word count of greater than 50 words and less than or equal to 500 words, type 3 includes text resources having a text word count of greater than 500 words and less than or equal to 10000 words, and type 4 includes text resources having a text word count of greater than 10000 words.
Alternatively, in the present embodiment, the characteristics of the text may include, but are not limited to: word features, semantic features, and structural features, among others. As shown in fig. 4, the word features may include, but are not limited to, local features and global features, wherein the extraction manner of the local features may include, but is not limited to: TF mode, ngram mode, etc., the global feature extraction mode may include, but is not limited to: DF mode, IDF mode, RF mode, chi-Score mode, one mode, and the like. The extraction of word features may be performed by combining local features and global features, for example: TF-DF mode, TF-IDF mode, TF-RF mode, TF-Chi-Score mode, and the like. Semantic features, also known as multidimensional features, may be extracted by ways that include, but are not limited to: LSI method, LDA method, word2Vec method, wherein LSI method can be used for one-sense multi-Word and feature dimension reduction, etc., LDA method can be used for one-sense multi-Word or one-Word multi-sense, word2Vec method can be used for expanding Word sense. Structural features may include, but are not limited to, article structures and media structures, among others. Among other things, article structures may include, but are not limited to: title, text paragraph, paragraph location, etc. Media structures may include, but are not limited to: media history distribution, high purity media, regular media, and so forth.
Alternatively, in the present embodiment, the correspondence between the feature corresponding to the text resource and the text topic may be represented using, but not limited to, a natural language classification model. The natural language classification model may include, but is not limited to: a Naive Bayes (NB) classification model, a Support Vector Machine (SVM) model, a decision tree model, a fast text classifier (FastText) model, a maximum entropy (MaxEnt) model, and so forth.
The idea underlying the Naive Bayes (NB) classification model is: for a given item to be classified, solving the probability of each category under the condition that the item appears, and considering which category the item to be classified belongs to.
The support vector machine model improves the generalization capability of the learning machine by seeking the minimum structural risk, and achieves the minimization of experience risk and confidence range, so that the aim of obtaining good statistical rules under the condition of less statistical sample size is achieved. In popular terms, the model is a class-two classification model, the basic model is defined as a linear classifier with the largest interval on the feature space, namely, the learning strategy of the support vector machine is the interval maximization, and finally, the model can be converted into a solution of a convex quadratic programming problem.
The decision tree is a tree structure (which may be a binary tree or a non-binary tree). Each non-leaf node of which represents a test on a characteristic attribute, each branch representing the output of this characteristic attribute over a range of values, and each leaf node storing a class. The decision making process using the decision tree is to start from the root node, test the corresponding characteristic attribute in the item to be classified, select the output branch according to the value until the leaf node is reached, and take the category stored by the leaf node as the decision result.
Therefore, through the device, text resources with different lengths are suitable for different feature extraction modes and text classification models, the text resources are divided into a plurality of text types according to the text word numbers of the text resources, each text type corresponds to one feature extraction mode suitable for the text resources of the text type, each text type also corresponds to a corresponding relation between a feature and a text topic, when the text resources are classified, the target text type to which the text resources to be classified belong is determined according to the text word numbers of the text resources to be classified, the target feature extraction mode corresponding to the target text type is adopted to extract the target feature of the text resources to be classified, and the target text topic of the text resources to be classified is determined according to the target text topic corresponding to the target text type, so that the extraction mode and the corresponding relation adopted when the text resources are classified are matched with the partition type of the text resources, and the accuracy of the text resources is improved, and the problem of poor text resource classification accuracy in the prior art is further overcome.
As an alternative, the first determining module includes:
1) The statistics unit is used for counting the number of text words of the text resources to be classified;
2) A first determining unit, configured to determine, when the number of text words is less than or equal to the first number of words, that a target text type to which the text resource to be classified belongs is a first text type;
3) The second determining unit is used for determining that the target text type to which the text resource to be classified belongs is a second text type under the condition that the text word number is larger than the first word number and smaller than or equal to the second word number;
4) And the third determining unit is used for determining that the target text type to which the text resource to be classified belongs is a third text type under the condition that the text word number is larger than the second word number.
Alternatively, in this embodiment, the above-mentioned plurality of text types divided according to the number of text words of the text resource may include, but is not limited to, three text types, the first text type may be a short text, the second text type may be a medium-length text, and the third text type may be a long text, where the division criterion for the number of text words may be set according to, but not limited to, the specific case of the text stored in the database. For example: there are 10000 text resources in the database, wherein the minimum number of words is 10 words, the number of words is 2000 words, the number of words is between 10 and 500 words accounts for 35% of the total number of texts, the number of words is between 500 and 1550 words accounts for 30% of the total number of texts, the number of words is between 1550 and 2000 words accounts for 35% of the total number of texts, then the text type of the text with the number of words below 500 can be set as short text, the text type of the text with the number of words between 500 and 1550 words is medium length text, and the text type of the text with the number of words above 2000 is long text. As shown in fig. 5, the text resource to be classified is obtained, the number of text words of the text resource to be classified is counted to obtain 756 words of the number of text words of the text resource to be classified, and the text resource to be classified can be determined to be a medium text according to the set rule of dividing the text type.
As an alternative, in the case that the target text type of the text resource to be classified is the first text type, the extracting module includes:
1) The first word segmentation unit is used for segmenting the text resources to be classified to obtain first word segmentation words;
2) The first word segmentation unit is used for obtaining first target implicit features corresponding to the first word segmentation words from a first corresponding relation, wherein the first corresponding relation is a corresponding relation between each first word and the implicit features used for representing each first word, and the first words comprise the first word segmentation words;
3) And a fourth determining unit, configured to determine the first target implicit feature as a target feature.
Alternatively, in this embodiment, for a text resource of a first text type with a smaller number of text words, since there may be fewer first word segmentation words obtained by segmenting the text resource, the features of the text may not be fully represented, so that the implicit features corresponding to the first word segmentation words may be obtained in the first correspondence relationship for representing the correspondence relationship between each first word and the implicit feature, and the implicit feature may be used as the target feature of the text resource. Wherein implicit features are used to characterize each first word.
In an alternative embodiment, the implicit feature may be training the Word2vec model with a large number of history articles in the news base as corpus. After a large number of historical articles in a news base are segmented and deactivated, keywords are extracted, training is conducted by using a Word2vec model, each keyword is converted into a vector representation, and the representation relationship can be used for representing the similarity between words. Word vector characterization may be an implicit feature.
As an alternative, the apparatus further includes:
1) The acquisition module is used for acquiring a first training text;
2) The classification module is used for classifying the first training texts according to the text word number rule to obtain a plurality of first training text sets;
3) The training module is used for training the classifier corresponding to each first training text set through each first training text set in the plurality of first training text sets to obtain a plurality of target classifiers, wherein the plurality of target classifiers are used for indicating the corresponding relation between the characteristics corresponding to the text resources of each text type and the text subjects.
Optionally, in this embodiment, the same text word number rule is used for classifying the first training text as when classifying the text resource to be classified, so that it is ensured that the training text and the text resource are classified by using the same scale, and thus the accuracy of determining the topic of the text resource is higher.
Alternatively, in the present embodiment, the correspondence between the features and the text topic may be determined, but is not limited to, by training a classifier using training text. And dividing the acquired first training text into a plurality of training samples of the text types according to the number of the text words, and training the classifier corresponding to the type by using the training samples of each type.
In an alternative embodiment, as shown in fig. 6, the first training text is divided into three different types of training texts according to the number of text words: training text A corresponding to the first text type, training text B corresponding to the second text type and training text C corresponding to the third text type, and three classifiers corresponding to the training texts of different types are respectively: classifier a, classifier B and classifier C. Training the classifier A by using the training text A to obtain a target classifier A, training the classifier B by using the training text B to obtain a target classifier B, and training the classifier C by using the training text C to obtain a target classifier C. The target classifier a indicates a correspondence between a feature corresponding to a text resource of a first text type and a text topic, and the target classifier a may be used to classify the text resource of the first text type. The target classifier B indicates a correspondence between the feature corresponding to the text resource of the second text type and the text topic, and the target classifier B may be used to classify the text resource of the second text type. The target classifier C indicates a correspondence between a feature corresponding to a text resource of the third text type and a text topic, and the target classifier C may be used to classify the text resource of the third text type.
As an alternative, the classifier corresponding to each first training text set includes multiple types of classification models, where the training module includes:
1) The first training unit is used for training a plurality of types of classification models through each first training text set to obtain a first classifier comprising a plurality of target classification models, wherein the plurality of target classification models are used for indicating the corresponding relation between the characteristics corresponding to the text resources under the type of classification models and the topics;
2) The first acquisition unit is used for acquiring a second training text;
3) The classification unit is used for classifying the second training texts according to the text word number rule to obtain a plurality of second training text sets;
4) The first input unit is used for inputting each second training text set in the plurality of second training text sets into a first classifier corresponding to the second training text set to obtain a plurality of output results of each first classifier, wherein the output results are used for indicating the corresponding relation between each second training text set and the topic confidence coefficient, and the topic confidence coefficient is used for indicating the probability that each training text in each second training text set belongs to each topic in all topics;
5) The second training unit is used for training the decision classifier according to the plurality of output results to obtain a second classifier, and taking the second classifier as a target classifier, wherein the decision classifier is used for indicating the contribution weight of the output result of each type of classification model in the plurality of types of classification models to each topic of all topics belonging to each training text.
Optionally, in this embodiment, when the text resources to be classified, the first training text and the second training text are classified according to the number of text words, the same classification rule is adopted, so that the classification scale of the text according to the number of text words is ensured to be the same, and therefore, the accuracy of determining the topic of the text resources is higher.
Optionally, in this embodiment, the classifier corresponding to each first training text set includes multiple types of classification models. The plurality of types of classification models may include, but are not limited to: NB (naive bayes), SVM, fastText, maxEnt, etc. In the process of determining the target classifier corresponding to each text type, a first training text set split from a first training text is used for training a plurality of types of classification models in the classifier corresponding to the first training text set to obtain the first classifier, a second training text is obtained, the second training text is divided according to the same standard to obtain a second training text set, the second training text set is input into the corresponding first classifier to obtain a corresponding output result, a decision classifier is trained by using the output result corresponding to each text type to obtain the second classifier, and the second classifier is used as the target classifier corresponding to the text type. Wherein the decision classifier is operable to indicate a contribution weight of the output result of the classification model of each type under each text type to each of the subjects to which each training text belongs. That is, the decision classifier indicates how much each type of classification model affects the text topic of the text resource.
In an alternative embodiment, the decision classifier may be trained by the second training text of the label, but instead of directly taking the initial content of the training samples, the training samples are translated into a confidence probability that each sample belongs to a certain class. For example: sports 1:0.31, entertainment 1: and then regarding the sports 1:0.31 as a sports feature obtained by the first classification model, wherein the feature value is 0.31, and training by the decision classifier according to a training sample to obtain the contribution weight of each classification model in a certain class finally, such as the contribution value of sports 1 in the decision classifier, the contribution value of sports 1 in entertainment and the like, so that the problem that no contribution value exists between similar classes can be solved. For example, an article may be both a cultural class and a historical class, and if a decision classifier is not available, a history is cast by voting, and cultural components are lost. Compared with the voting mode, for example: the output of the four classification models, three sports and one entertainment, that is identified as sports. In this embodiment, the probabilities of a certain class are output by four classification models, for example: the first classification model, 31% sports, 30% entertainment, 40% health, 52% life, the second, third and fourth classifier outputs are similar, and the decision classifier will calculate the topic to which the text resource belongs comprehensively, possibly health and life.
Alternatively, in the present embodiment, for a classification model without a confidence output, its output may be converted to a confidence output, for example: the class models of NB and SVM have no confidence output because of the algorithm design, so that the output of the class models can be converted into probability output by adopting an order preserving regression mode aiming at the NB classifier. The SVM may convert its output to a probabilistic output in a Platt scaling manner. Thus, the output of all classification models is converted into the probability space of [0,1], and can be used as the input of a decision classifier to train a final target classifier.
Alternatively, in this embodiment, the first training text and the second training text may be, but not limited to, 2 training texts that are obtained by randomly segmenting the training corpus according to the golden section in a large training sample space through a random sampling process.
In an alternative embodiment, as shown in fig. 7, the total training text is divided into a first training text and a second training text, and the first training text is divided into three different types of training texts according to the number of text words: training text A corresponding to the first text type, training text B corresponding to the second text type and training text C corresponding to the third text type, and three classifiers corresponding to the training texts of different types are respectively: classifier a, classifier B and classifier C. The classifier A comprises a classification model A1, a classification model A2, a classification model A3 and a classification model A4, the classifier B comprises a classification model B1, a classification model B2, a classification model B3 and a classification model B4, the classifier C comprises a classification model C1, a classification model C2, a classification model C3 and a classification model C4, the classification model in the classifier A is trained by using a training text A to obtain a first classifier A, the classification model in the classifier B is trained by using a training text B to obtain a first classifier B, and the classification model in the classifier C is trained by using a training text C to obtain a first classifier C. Dividing the second training text into three different types of training texts according to the number of text words: training text M corresponding to the first text type, training text N corresponding to the second text type and training text P corresponding to the third text type. The training text M is input into a first classifier A to obtain an output result A, the training text N is input into a first classifier B to obtain an output result B, and the training text P is input into a first classifier C to obtain an output result C. And training the decision classifier A by using the output result A to obtain a target classifier A, training the decision classifier B by using the output result B to obtain a target classifier B, and training the decision classifier C by using the output result C to obtain a target classifier C.
As an alternative, in a case that the text type of the training text included in the first training text set is a first text type, where the number of text words of the text resource of the first text type is less than or equal to the first number of words, the training module includes:
1) The second word segmentation unit is used for segmenting the first training text set to obtain second word segmentation words;
2) The second acquisition unit is used for acquiring a theme-word matrix and a document-theme matrix of the first training text set;
3) The third obtaining unit is used for obtaining second target implicit features corresponding to second word-division words from second corresponding relations, wherein the second corresponding relations are corresponding relations between each second word and the implicit features used for representing each second word, and the second words comprise the second word-division words;
4) The conversion unit is used for converting the theme-word matrix into a theme-implicit characteristic matrix according to the second target implicit characteristic corresponding to the second word;
5) And the third training unit is used for training the classifier corresponding to the first training text set through the theme-implicit characteristic matrix and the document-theme matrix.
Alternatively, in this embodiment, a topic-word matrix may be used to indicate a distribution relationship between topics and second word-words, and a document-topic matrix may be used to indicate a distribution relationship between the first training text set and topics.
Alternatively, in this embodiment, for text resources with a small number of text words, training the classifier using training text may convert words extracted therefrom into implicit features.
In an alternative embodiment, the word meaning may not be accurately determined by using the word alone because the word number of the short text is too small, the case of word multi-meaning and word multi-meaning is obvious in the short text, and the topic model based on the LDA can well solve the case of word multi-meaning and word multi-meaning, but if the training sample of the LDA is too short, it is difficult to construct a topic model with high quality. Therefore, in this embodiment, an LDA algorithm based on implicit features, i.e. an LDA4V model, is proposed for the feature extraction method corresponding to the short text, and based on the LDA algorithm, the dirichlet distribution of topic-word (equivalent to the topic-word matrix) is replaced by a combination of two distribution matrices: dirichlet distribution and implicit features of topic-word. Wherein the implicit features are generated by a word vector model trained from a number of external corpora. The word vector contains unigram and bigram bi-patterns. In this embodiment, implicit features are added, and two distributed matrices (topic-word matrix and word-implicit feature matrix) are used to represent the relationship between the implicit features and the topic, and the two distributed matrices finally perform sampling operation to generate a plurality of words corresponding to the topic, so that the extracted feature words are richer and have more definite meaning.
As an alternative, the second determining module includes:
1) The second input unit is used for inputting the target characteristics into a third classifier corresponding to the target text type in the target classifiers to obtain an output result of the third classifier;
2) And a fifth determining unit for determining the output result of the third classifier as the target text subject.
Optionally, in this embodiment, each target classifier represents a correspondence between a text resource feature of a corresponding text type and a text topic. Determining a target classifier corresponding to the target text type according to the target text type to which the text resource to be classified belongs, namely determining a third classifier, and inputting target features extracted from the text resource to be classified into the third classifier to obtain an output result of the third classifier as a target text subject of the text resource to be classified.
The application environment of the embodiment of the present invention may be, but is not limited to, the application environment in the above embodiment, and this will not be described in detail in this embodiment. The embodiment of the invention provides an alternative specific application example for implementing the text resource classification method.
As an alternative embodiment, the above-mentioned text resource classification method may be applied, but not limited to, in a scenario of classifying text resources as shown in fig. 9. On a main recommendation page in a news information application, a search engine application or a text push application, the system recommends text resources of different categories according to different users, the categories are used as basic data of a recommendation system, the accuracy of the recommendation system has a great influence on the accuracy of recommendation, and how to improve the accuracy of text classification is very important. As shown in fig. 9, in the model training phase, feature selection is performed on the text training set, the text training set is classified, and different types of features are used to train the classification model. In the classification stage of the text resource, inputting the acquired new articles into a classification model to obtain classification topics corresponding to the new articles, sending the corresponding relation between the new articles and the topics to a recommendation system, and recommending articles with different topics for a main recommendation page of a user according to the corresponding relation by the recommendation system.
In an alternative embodiment, as shown in fig. 10, for the training phase of the classification model, the training corpus in the training corpus of text is randomly segmented into 2 parts (sample a, sample B) according to the golden section by a random sampling process, one part of sample a is used to train a combined model for producing a single classifier NB, SVM, fastText, maxEnt, and the other part of sample B is used to produce a single classifier (using an LR logistic regression model) as a decision classifier. Accurate recall indexes can be calculated in a test library (new test document) through the generated overall classification model, and meanwhile, false cases can be collected and integrated into a training library to iterate and optimize the model.
According to still another aspect of the embodiment of the present invention, there is also provided an electronic device for implementing the method for playing video images as described above, as shown in fig. 11, the electronic device may include: one or more (only one is shown) processors 1102, memory 1104, sensors 1106, encoders 1108, and transmission means 1110.
The memory 1104 may be used to store software programs and modules, such as a method and apparatus for playing video images in an embodiment of the present invention.
The processor 1102 performs various functional applications and data processing, i.e., image encoding methods, by executing software programs and modules stored in the memory 1104. Memory 1104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 1104 may further include memory remotely located relative to the processor 1102, which may be connected to the terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device 1110 is used to receive or transmit data via a network. Specific examples of the network described above may include wired networks and wireless networks. In one example, the transmission device 1110 includes a network adapter (Network Interface Controller, NIC) that may be connected to other network devices and routers via a network cable to communicate with the internet or a local area network. In one example, the transmission device 1110 is a Radio Frequency (RF) module that is configured to communicate wirelessly with the internet.
Specifically, the memory 1104 is used for storing information of preset action conditions and preset authority users, and application programs.
Alternatively, specific examples in this embodiment may refer to examples described in the foregoing embodiments, and this embodiment is not described herein.
It will be appreciated by those skilled in the art that the structure shown in fig. 11 is only illustrative, and the electronic device may be a smart phone (such as an Android phone, an iOS phone, etc.), a tablet computer, a palm computer, a mobile internet device (Mobile Internet Devices, MID), a PAD, etc. Fig. 11 is not limited to the structure of the electronic device. For example, the electronic device may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in FIG. 11, or have a different configuration than shown in FIG. 11.
Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program for instructing a terminal device to execute in association with hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic or optical disk, and the like.
The embodiment of the invention also provides a storage medium. Alternatively, in the present embodiment, the storage medium may be located in at least one network device among a plurality of network devices in a network.
Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of:
s1, determining a target text type to which a text resource to be classified belongs according to the number of text words of the text resource to be classified, wherein the target text type is one of a plurality of text types, and the plurality of text types are text types divided according to the number of text words of the text resource;
s2, extracting target features from the text resources to be classified by adopting a target feature extraction mode corresponding to the target text types, wherein the text types correspond to the feature extraction modes;
S3, determining a target text theme corresponding to the target feature from target corresponding relations corresponding to the target text types, wherein the target corresponding relations are used for indicating corresponding relations between features corresponding to text resources with the text types being the target text types and the text theme.
Alternatively, in the present embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Alternatively, specific examples in this embodiment may refer to examples described in the foregoing embodiment 1 and embodiment 2, and this embodiment is not described herein.
The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
The integrated units in the above embodiments may be stored in the above-described computer-readable storage medium if implemented in the form of software functional units and sold or used as separate products. Based on such understanding, the technical solution of the present invention may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing one or more computer devices (which may be personal computers, servers or network devices, etc.) to perform all or part of the steps of the method described in the embodiments of the present invention.
In the foregoing embodiments of the present invention, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.
In several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, such as the division of the units, is merely a logical function division, and may be implemented in another manner, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims (12)

1. A method for classifying text resources, comprising:
determining a target text type of a text resource to be classified in a plurality of text types according to the text word number of the text resource to be classified, wherein the plurality of text types are text types divided according to a text word number rule; extracting target features from the text resources to be classified by adopting a target feature extraction mode corresponding to the target text type in a plurality of feature extraction modes;
training through a first training text set to obtain a target classifier;
Determining a target text theme corresponding to the target feature from target corresponding relations corresponding to the target text types by using the target classifier, wherein the target corresponding relations are used for indicating corresponding relations between features corresponding to text resources of the target text types and the text theme;
the extracting the target feature from the text resource to be classified by adopting a target feature extraction mode corresponding to the target text type in a plurality of feature extraction modes comprises: under the condition that the target text type of the text resource to be classified is a first text type, word segmentation is carried out on the text resource to be classified to obtain a first word segmentation word; acquiring first target implicit features corresponding to the first word segmentation words from a first corresponding relation, wherein the first corresponding relation is a corresponding relation between each first word and the implicit features used for representing the first words, and the first words comprise the first word segmentation words; determining the first target implicit feature as the target feature;
the training through the first training text set to obtain the target classifier comprises the following steps: under the condition that the text type of the training text included in the first training text set is a first text type, word segmentation is carried out on the first training text set, and second word segmentation words are obtained; obtaining a topic-word matrix and a document-topic matrix of the first training text set; acquiring a second target implicit characteristic corresponding to the second word from a second corresponding relation, wherein the second corresponding relation is a corresponding relation between each second word and the implicit characteristic used for representing the second word, and the second word comprises the second word; converting the theme-word matrix into a theme-implicit feature matrix according to the second target implicit feature corresponding to the second word; and training the classifier corresponding to the first training text set through the theme-implicit characteristic matrix and the document-theme matrix.
2. The method of claim 1, wherein determining the target text type to which the text resource to be classified belongs in a plurality of text types based on the number of text words of the text resource to be classified comprises:
counting the text word number of the text resource to be classified;
determining that the target text type to which the text resource to be classified belongs is a first text type when the text word number is smaller than or equal to a first word number;
determining that the target text type to which the text resource to be classified belongs is a second text type under the condition that the text word number is larger than the first word number and smaller than or equal to a second word number;
and determining that the target text type to which the text resource to be classified belongs is a third text type under the condition that the text word number is larger than the second word number.
3. The method of claim 1, wherein prior to determining the target text topic corresponding to the target feature from the target correspondence corresponding to the target text type, the method further comprises:
acquiring a first training text;
classifying the first training texts according to the text word number rule to obtain a plurality of first training text sets;
Training the classifier corresponding to each first training text set through each first training text set in the plurality of first training text sets to obtain a plurality of target classifiers, wherein the plurality of target classifiers are used for indicating the corresponding relation between the characteristics corresponding to the text resources of each text type and the text subjects.
4. The method of claim 3, wherein the classifier corresponding to each first training text set includes a plurality of types of classification models, wherein training the classifier corresponding to each first training text set through each first training text set, and obtaining the target classifier includes:
training the multiple types of classification models through each first training text set to obtain a first classifier comprising multiple target classification models, wherein the multiple target classification models are used for indicating the corresponding relation between the characteristics corresponding to the text resources under the type of classification models and the topics;
acquiring a second training text;
classifying the second training texts according to the text word number rule to obtain a plurality of second training text sets;
Inputting each second training text set in the plurality of second training text sets into a first classifier corresponding to the second training text set to obtain a plurality of output results of each first classifier, wherein the output results are used for indicating the corresponding relation between each second training text set and topic confidence level, and the topic confidence level is used for indicating the probability that each training text in each second training text set belongs to each topic in all topics;
training a decision classifier according to the plurality of output results to obtain a second classifier, and taking the second classifier as the target classifier, wherein the decision classifier is used for indicating the contribution weight of the output result of each type of classification model in the plurality of types of classification models to each topic of all topics belonging to each training text.
5. The method of claim 3, wherein determining a target text topic corresponding to the target feature from a target correspondence corresponding to the target text type comprises:
inputting the target characteristics into a third classifier corresponding to the target text type in the target classifiers to obtain an output result of the third classifier;
And determining the output result of the third classifier as the target text theme.
6. A text resource classification device, comprising:
the first determining module is used for determining a target text type of the text resource to be classified in a plurality of text types according to the text word number of the text resource to be classified, wherein the plurality of text types are text types divided according to a text word number rule;
the device is also used for obtaining a target classifier through training of the first training text set; the extraction module is used for extracting target features from the text resources to be classified by using the target classifier in a target feature extraction mode corresponding to the target text type in a plurality of feature extraction modes;
the second determining module is used for determining a target text theme corresponding to the target feature from a target corresponding relation corresponding to the target text type, wherein the target corresponding relation is used for indicating a corresponding relation between the feature corresponding to the text resource of the target text type and the text theme;
the apparatus further comprises: the first word segmentation unit is used for segmenting the text resource to be classified to obtain a first word segmentation word under the condition that the target text type of the text resource to be classified is a first text type; the first word segmentation unit is used for obtaining first target implicit features corresponding to the first word segmentation words from a first corresponding relation, wherein the first corresponding relation is a corresponding relation between each first word and the implicit features used for representing the first words, and the first words comprise the first word segmentation words; a fourth determining unit, configured to determine the first target implicit feature as the target feature;
The device further comprises: under the condition that the text type of the training text included in the first training text set is the first text type, a second word segmentation unit is used for segmenting the first training text set to obtain second word segmentation words; the second acquisition unit is used for acquiring a topic-word matrix and a document-topic matrix of the first training text set; the third obtaining unit is configured to obtain a second target implicit feature corresponding to the second word from a second correspondence, where the second correspondence is a correspondence between each second word and an implicit feature used to characterize each second word, and the second word includes the second word.
7. The apparatus of claim 6, wherein the first determining module comprises:
the statistics unit is used for counting the text word number of the text resource to be classified;
a first determining unit, configured to determine, when the number of text words is less than or equal to a first number of words, that the target text type to which the text resource to be classified belongs is a first text type;
a second determining unit, configured to determine, when the number of text words is greater than the first number of words and less than or equal to a second number of words, that the target text type to which the text resource to be classified belongs is a second text type;
And the third determining unit is used for determining that the target text type to which the text resource to be classified belongs is a third text type when the text word number is larger than the second word number.
8. The apparatus of claim 6, wherein the apparatus further comprises:
the acquisition module is used for acquiring a first training text;
the classification module is used for classifying the first training texts according to the text word number rule to obtain a plurality of first training text sets;
the training module is used for training the classifier corresponding to each first training text set through each first training text set in the plurality of first training text sets to obtain a plurality of target classifiers, wherein the plurality of target classifiers are used for indicating the corresponding relation between the characteristics corresponding to the text resources of each text type and the text subjects.
9. The apparatus of claim 8, wherein the classifier corresponding to each first training text set includes a plurality of types of classification models therein, and wherein the training module includes:
the first training unit is used for training the multiple types of classification models through each first training text set to obtain a first classifier comprising multiple target classification models, wherein the multiple target classification models are used for indicating the corresponding relation between the characteristics corresponding to the text resources under the type of classification models and the subjects;
The first acquisition unit is used for acquiring a second training text;
the classification unit is used for classifying the second training texts according to the text word number rule to obtain a plurality of second training text sets;
the first input unit is used for inputting each second training text set in the plurality of second training text sets into a first classifier corresponding to the second training text set to obtain a plurality of output results of each first classifier, wherein the output results are used for indicating the corresponding relation between each second training text set and the topic confidence coefficient, and the topic confidence coefficient is used for indicating the probability that each training text in each second training text set belongs to each topic in all topics;
the second training unit is configured to train a decision classifier according to the multiple output results to obtain a second classifier, and take the second classifier as the target classifier, where the decision classifier is configured to indicate a contribution weight of an output result of each of the multiple types of classification models to each of the subjects to which each training text belongs.
10. The apparatus of claim 8, wherein the second determining module comprises:
the second input unit is used for inputting the target characteristics into a third classifier corresponding to the target text type in the target classifiers to obtain an output result of the third classifier;
and a fifth determining unit, configured to determine an output result of the third classifier as the target text topic.
11. A storage medium comprising a stored program, wherein the program when run performs the method of any one of the preceding claims 1 to 5.
12. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor performs the method according to any of the preceding claims 1 to 5 by means of the computer program.
CN201711088170.XA 2017-11-07 2017-11-07 Text resource classification method and device, storage medium and electronic device Active CN110019794B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711088170.XA CN110019794B (en) 2017-11-07 2017-11-07 Text resource classification method and device, storage medium and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711088170.XA CN110019794B (en) 2017-11-07 2017-11-07 Text resource classification method and device, storage medium and electronic device

Publications (2)

Publication Number Publication Date
CN110019794A CN110019794A (en) 2019-07-16
CN110019794B true CN110019794B (en) 2023-04-25

Family

ID=67186499

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711088170.XA Active CN110019794B (en) 2017-11-07 2017-11-07 Text resource classification method and device, storage medium and electronic device

Country Status (1)

Country Link
CN (1) CN110019794B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11861301B1 (en) * 2023-03-02 2024-01-02 The Boeing Company Part sorting system

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110781276B (en) * 2019-09-18 2023-09-19 平安科技(深圳)有限公司 Text extraction method, device, equipment and storage medium
CN113570380A (en) * 2020-04-28 2021-10-29 中国移动通信集团浙江有限公司 Service complaint processing method, device and equipment based on semantic analysis and computer readable storage medium
CN114186057A (en) * 2020-09-15 2022-03-15 智慧芽(中国)科技有限公司 Automatic classification method, device, equipment and storage medium based on multi-type texts
CN112199499B (en) * 2020-09-29 2024-06-18 京东方科技集团股份有限公司 Text division method, text classification method, device, equipment and storage medium
CN112883190A (en) * 2021-01-28 2021-06-01 平安科技(深圳)有限公司 Text classification method and device, electronic equipment and storage medium
CN113177595B (en) * 2021-04-29 2022-07-12 北京明朝万达科技股份有限公司 Document classification model construction, training and testing method and model construction system
CN113792150B (en) * 2021-11-15 2022-02-11 湖南科德信息咨询集团有限公司 Man-machine cooperative intelligent demand identification method and system
CN114281992A (en) * 2021-12-22 2022-04-05 北京朗知网络传媒科技股份有限公司 Automobile article intelligent classification method and system based on media field
CN115187153B (en) * 2022-09-14 2022-12-09 杭银消费金融股份有限公司 Data processing method and system applied to business risk tracing

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101251855A (en) * 2008-03-27 2008-08-27 腾讯科技(深圳)有限公司 Equipment, system and method for cleaning internet web page
CN101621391A (en) * 2009-08-07 2010-01-06 北京百问百答网络技术有限公司 Method and system for classifying short texts based on probability topic
CN103577462A (en) * 2012-08-02 2014-02-12 北京百度网讯科技有限公司 Document classification method and document classification device
CN104054075A (en) * 2011-12-06 2014-09-17 派赛普申合伙公司 Text mining, analysis and output system
WO2017080090A1 (en) * 2015-11-14 2017-05-18 孙燕群 Extraction and comparison method for text of webpage

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101251855A (en) * 2008-03-27 2008-08-27 腾讯科技(深圳)有限公司 Equipment, system and method for cleaning internet web page
CN101621391A (en) * 2009-08-07 2010-01-06 北京百问百答网络技术有限公司 Method and system for classifying short texts based on probability topic
CN104054075A (en) * 2011-12-06 2014-09-17 派赛普申合伙公司 Text mining, analysis and output system
CN103577462A (en) * 2012-08-02 2014-02-12 北京百度网讯科技有限公司 Document classification method and document classification device
WO2017080090A1 (en) * 2015-11-14 2017-05-18 孙燕群 Extraction and comparison method for text of webpage

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
text classification method based on self-training and LDA topic models;Miha Pavlinek等;《Expert systems with applications》;第8卷;83-93 *
基于MRT-LDA模型的微博文本分类;庞雄文等;《计算机科学》;第41卷(第8期);236-241+259 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11861301B1 (en) * 2023-03-02 2024-01-02 The Boeing Company Part sorting system

Also Published As

Publication number Publication date
CN110019794A (en) 2019-07-16

Similar Documents

Publication Publication Date Title
CN110019794B (en) Text resource classification method and device, storage medium and electronic device
CN110162593B (en) Search result processing and similarity model training method and device
CN110309427B (en) Object recommendation method and device and storage medium
Alam et al. Processing social media images by combining human and machine computing during crises
CN106649818B (en) Application search intention identification method and device, application search method and server
CN106951422B (en) Webpage training method and device, and search intention identification method and device
US9087297B1 (en) Accurate video concept recognition via classifier combination
AU2011326430B2 (en) Learning tags for video annotation using latent subtags
Shi et al. Learning-to-rank for real-time high-precision hashtag recommendation for streaming news
US9176969B2 (en) Integrating and extracting topics from content of heterogeneous sources
CN108647322B (en) Method for identifying similarity of mass Web text information based on word network
CN110209809B (en) Text clustering method and device, storage medium and electronic device
CN112559747B (en) Event classification processing method, device, electronic equipment and storage medium
CN111557000B (en) Accuracy Determination for Media
Bounabi et al. A comparison of text classification methods using different stemming techniques
CN111813993A (en) Video content expanding method and device, terminal equipment and storage medium
US9454568B2 (en) Method, apparatus and computer storage medium for acquiring hot content
CN111859079B (en) Information searching method, device, computer equipment and storage medium
CN110162769B (en) Text theme output method and device, storage medium and electronic device
CN110019763B (en) Text filtering method, system, equipment and computer readable storage medium
CN114490923A (en) Training method, device and equipment for similar text matching model and storage medium
CN110413770B (en) Method and device for classifying group messages into group topics
CN106446696B (en) Information processing method and electronic equipment
WO2023155304A1 (en) Keyword recommendation model training method and apparatus, keyword recommendation method and apparatus, device, and medium
CN116108181A (en) Client information processing method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant