CN110347821B - Text category labeling method, electronic equipment and readable storage medium - Google Patents

Text category labeling method, electronic equipment and readable storage medium Download PDF

Info

Publication number
CN110347821B
CN110347821B CN201910456149.3A CN201910456149A CN110347821B CN 110347821 B CN110347821 B CN 110347821B CN 201910456149 A CN201910456149 A CN 201910456149A CN 110347821 B CN110347821 B CN 110347821B
Authority
CN
China
Prior art keywords
text
category
candidate
classification model
annotated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910456149.3A
Other languages
Chinese (zh)
Other versions
CN110347821A (en
Inventor
过弋
张振豪
王志宏
樊振
韩美琪
王家辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China University of Science and Technology
Shihezi University
Original Assignee
East China University of Science and Technology
Shihezi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China University of Science and Technology, Shihezi University filed Critical East China University of Science and Technology
Priority to CN201910456149.3A priority Critical patent/CN110347821B/en
Publication of CN110347821A publication Critical patent/CN110347821A/en
Application granted granted Critical
Publication of CN110347821B publication Critical patent/CN110347821B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9536Search customisation based on social or collaborative filtering
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application relates to the technical field of computers, and discloses a method, electronic equipment and a readable storage medium for category labeling of a text hierarchical structure. The method for labeling the text category comprises the following steps: searching candidate categories corresponding to the text to be annotated by combining the cognitive factors; determining a classification model according to candidate categories, wherein the classification model is obtained through training according to each first sample text and a bottom sub-category corresponding to each first sample text, the bottom sub-category is positioned at the bottom of a category hierarchy structure where the candidate categories are positioned, the candidate categories are positioned at the top of the category hierarchy structure and are father categories, and the category hierarchy structure at least comprises 2 layers of categories; and determining the actual text category of the text to be annotated according to the text to be annotated and the classification model, and annotating the text to be annotated according to the actual text category. According to the embodiment, the probability of marking errors can be reduced, and the accuracy of marking is improved.

Description

Text category labeling method, electronic equipment and readable storage medium
Technical Field
The embodiment of the application relates to the technical field of computers, in particular to a text category labeling method, electronic equipment and a readable storage medium.
Background
Due to the rapid growth of the internet, people are also increasingly dependent on obtaining information from the network. However, the rapid increase of the text data volume greatly influences the efficiency and result of information acquisition. In order to better provide for the retrieval of text, text data is typically labeled by category, for example, news-like text is labeled by category (e.g., sports, entertainment, etc.), so that the relevant text can be quickly and accurately retrieved when the user is retrieving. Early text labeling methods were derived from cognition, and people judged the category to which a certain text belongs by experience accumulated in daily life and certain inference rules. With the development of computer technology, it is desired to give intelligence to the machine AI (Artificial intelligence) to generate cognitive AI, so that the machine can generate experience through learning to automatically judge the text category. The research work in this aspect has achieved great success and has led to the direct reliance of people on algorithm decisions, ignoring other factors.
The inventor finds that at least the following problems exist in the related text classification technology: as the current text data volume is large, and the types of the types are increased along with the increase of the data, the probability of errors in the process of automatically labeling the types of the text is greatly increased, and meanwhile, the speed of labeling the types of the text is also reduced.
Disclosure of Invention
The embodiment of the application aims to provide a text category labeling method, electronic equipment and a readable storage medium, so that the probability of labeling errors can be reduced, and the labeling accuracy is improved.
In order to solve the above technical problems, the embodiment of the present application provides a method for labeling text categories, including: searching candidate categories corresponding to the text to be annotated; determining a classification model according to candidate categories, wherein the classification model is obtained through training according to each first sample text and a bottom sub-category corresponding to each first sample text, the bottom sub-category is positioned at the bottom of a category hierarchy structure where the candidate categories are positioned, the candidate categories are positioned at the top of the category hierarchy structure and are father categories, and the category hierarchy structure at least comprises 2 layers of categories; and determining the actual text category of the text to be annotated according to the text to be annotated and the classification model, and annotating the text to be annotated according to the actual text category.
The embodiment of the application also provides electronic equipment, which comprises: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of text category labeling described above.
The embodiment of the application also provides a computer readable storage medium, and the method for labeling the text category is realized when the computer program is executed by a processor.
Compared with the prior art, the method and the device have the advantages that the types of the text types are huge, the size of the type used for determining the text to be marked can be reduced by searching the candidate type corresponding to the text to be marked, meanwhile, in the process of searching the candidate type, the obtained cognitive factor is combined with the similarity value, so that the similarity is corrected, the accuracy of searching each type is improved, and the number of bottom subcategories of the candidate type corresponding to the text to be marked is far smaller than that of all bottom subcategories, so that the actual text type of the text to be marked can be rapidly determined from the bottom subcategories of the candidate type through a classification model, the determination speed of determining the type of the text to be marked is improved, and the result generated by a machine learning algorithm is used as the cognitive factor instead of simply relying on the result, and the accuracy of searching the candidate type is improved by integrating the similarity and the cognitive factor; the classification model can be determined through the candidate category corresponding to the text to be marked, so that the classification model for determining the text to be marked has more pertinence, and the accuracy of determining the category of the text to be marked is improved; the candidate category is at the top layer of the category hierarchy and is a father category, and the category hierarchy comprises at least 2 layers of categories, and the sub categories of the text to be marked in each layer are judged in sequence without the sequence of the category hierarchy, but all the bottom sub categories of the corresponding candidate category are obtained directly, so that the influence caused by the error of category judgment can be reduced, the probability of marking error occurrence can be reduced, and the marking accuracy is improved.
In addition, searching candidate categories corresponding to the text to be annotated specifically comprises the following steps: acquiring a cognitive factor set of candidate categories of a text to be annotated, wherein the cognitive factor set can comprise matched candidate categories and initial probability values corresponding to the matched candidate categories; calculating the similarity between the category of the text to be annotated and each candidate category, and gathering each similarity to obtain a similarity set; determining a candidate category probability set according to the cognition factor set and the similarity set; and selecting the candidate category of the text to be marked according to a preset rule and the candidate category probability set. And combining the cognition factor set and the similarity set to determine a candidate category probability set, so that the accuracy of searching the candidate category corresponding to the text to be annotated is greatly improved.
In addition, the method for acquiring the cognitive factor set of the candidate category of the text to be annotated specifically comprises the following steps: inputting the text to be annotated into a preset initial candidate class classification model to obtain a candidate class cognition factor set of the text to be annotated, wherein the initial candidate class classification model is obtained according to each second sample text and the candidate class training corresponding to each second sample text. And through the pre-constructed initial candidate category classification model, the cognition factor set of the text to be annotated can be rapidly determined.
In addition, according to a preset rule and a candidate category probability set, selecting a candidate category of the text to be annotated, which specifically comprises: selecting a probability value larger than a preset threshold from the candidate category probability set, and taking a candidate category corresponding to the selected probability value as a candidate category of the text to be annotated; or arranging the probability values in the probability set of the candidate categories according to a descending order, selecting a preset number of probability values from the ordered probability set, and taking the candidate category corresponding to the selected probability value as the candidate category of the text to be annotated. According to the bottom subcategory set, a classification model corresponding to the bottom subcategory can be selected, so that the speed of determining the category of the text to be marked can be improved; the candidate category corresponding to the probability value with a larger value is selected, so that the speed and accuracy of determining the actual text category subsequently can be improved.
In addition, according to the candidate category, a classification model is determined, which specifically comprises: acquiring a bottom subcategory set formed by all bottom subcategories in the category hierarchy according to the candidate category and the category hierarchy corresponding to the candidate category; and determining a classification model according to the bottom subcategory set. According to the bottom subcategory set, a classification model corresponding to the bottom subcategory can be selected, so that the speed of determining the category of the text to be marked can be improved.
In addition, according to the bottom subcategory set, a classification model is determined, which specifically comprises: and determining a classification model according to the bottom subcategory set and the corresponding relation between the preset bottom subcategory set and the classification model. According to the corresponding relation, the classification model can be rapidly determined.
In addition, the training process of the classification model specifically comprises the following steps: according to the bottom subcategory set, a first sample set corresponding to the bottom subcategory set is obtained, wherein the first sample set comprises a first sample text corresponding to each bottom subcategory; taking each first text sample as input data of a classification model, and taking the bottom subcategory corresponding to each first text sample as output data of the classification model; and training to obtain a classification model according to the input data and the output data. The classification model is obtained through training according to the first text sample set corresponding to the bottom subcategory set, so that the classification model obtained through training is more targeted, and the speed of determining the actual text category of the text to be marked by using the classification model is improved.
In addition, after training to obtain the classification model and before determining the classification model according to the underlying subcategory set, the method for labeling the text category further comprises: and storing the classification model obtained by training and the corresponding relation between the classification model and the bottom subcategory set. The classification model obtained through training is stored in real time, so that the variety of the classification model can be continuously enriched, and the accuracy of determining the actual text type of the text to be marked by using the classification model in the follow-up process is improved.
Drawings
One or more embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements, and in which the figures of the drawings are not to be taken in a limiting sense, unless otherwise indicated.
FIG. 1 is a specific flow chart of a method for text category labeling according to a first embodiment of the present application;
FIG. 2 is a schematic diagram of a class hierarchy provided in accordance with a first embodiment of the present application;
FIG. 3 is a schematic view of a class hierarchy corresponding to a candidate class according to a first embodiment of the present application;
FIG. 4 is a schematic flow chart of training a classification model in a text class labeling method according to a second embodiment of the present application;
fig. 5 is a schematic diagram of a specific structure of an electronic device according to a third embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the following detailed description of the embodiments of the present application will be given with reference to the accompanying drawings. However, those of ordinary skill in the art will understand that in various embodiments of the present application, numerous technical details have been set forth in order to provide a better understanding of the present application. However, the claimed application may be practiced without these specific details and with various changes and modifications based on the following embodiments.
The following embodiments are divided for convenience of description, and should not be construed as limiting the specific implementation of the present application, and the embodiments can be mutually combined and referred to without contradiction.
The first embodiment of the application relates to a method for labeling text categories. The text category labeling method can be applied to electronic equipment with a search function, such as a news client, search engine equipment and the like. By automatically labeling the text categories, the method is beneficial to the user to quickly locate related texts when searching the texts, and improves the speed of searching the related texts by the user. The text to be annotated can be the text of news, papers, magazines and the like. A specific flow of the text category labeling method is shown in fig. 1.
Step 101: searching candidate categories corresponding to the text to be annotated.
Specifically, to facilitate text management, the text category generally has a hierarchical structure, where each level of category indicates that the text content is different in scope, and the category hierarchical structure is described below in conjunction with fig. 2, where fig. 2 is a category hierarchical structure, and a category located at the top layer is a parent category, where the text content indicated by the parent category is the widest scope, for example, a may be represented as sports news, B may be represented as entertainment news, and C may be represented as financial news; class C and class D are located on the L2 layer, class C is ball news, and class D is skating news; l3 is a specific category, for example, H is football, I is badminton, G is baseball; e is a shorthand class.
Each parent category includes multiple layers of subcategories, with lower layers of subcategories indicating more specific ranges of text content and more relevant to the corresponding text.
In a specific implementation, the process of searching the candidate category corresponding to the text to be annotated is as follows: acquiring a cognitive factor set of candidate categories of a text to be annotated, wherein the cognitive factor set comprises matched candidate categories and initial probability values corresponding to the matched candidate categories; calculating the similarity between the category of the text to be marked and each stored category, and gathering each similarity to obtain a similarity set; determining a candidate category probability set according to the cognition factor set and the similarity set; and selecting the candidate category of the text to be marked according to a preset rule and the candidate category probability set.
Calculating the similarity between the category of the text to be annotated and each stored category, and introducing the similarity between the category candidate of the text to be annotated and the candidate category c:
firstly, acquiring a word frequency-inverse document frequency (TF-IDF) set D in a corpus set, wherein the corpus set is a set of a plurality of texts; inputting the text to be annotated into the set D, obtaining tf-idf vectors w of the text to be annotated, calculating the sum of tf-idf vectors of the texts belonging to the candidate category c in the set D, marking the sum as sum (c), calculating the similarity between the tf-idf vectors w of the text to be annotated and sum (c), and taking the similarity as the similarity between the category of the text to be annotated and the candidate category c.
Specifically, the set of cognitive factors for the candidate class of the text to be annotated may be determined by the initial candidate classification model. Inputting the text to be annotated into a preset initial candidate class classification model to obtain a candidate class cognition factor set of the text to be annotated, wherein the initial candidate class classification model is obtained according to each second sample text and the candidate class training corresponding to each second sample text.
The second sample text can be a sample text manually marked with a parent category, each second sample text is used as input data of the initial candidate category classification model, the marked parent category is used as output data, and the initial candidate category classification model can be trained and obtained according to the input data and the input data.
The manner in which the initial candidate classification model is trained is described below in a specific example:
collecting news texts, manually labeling father categories for each collected news text, taking the collected news texts as second sample texts, taking the second sample texts as input data, taking the father category corresponding to labeling as output data, training out the initial candidate category model by using a machine learning algorithm, wherein the machine learning algorithm can be naive Bayes, logistic regression, long-short-term memory neural networks and the like.
It may be appreciated that the set of cognitive factors includes candidate categories that match the text to be annotated and initial probability values for the matched candidate categories, each cognitive factor being an initial probability value for a corresponding candidate category. That is, the initial candidate category model may output one or more initial probability values of the parent category to which the text to be annotated belongs, for example, if the text to be annotated is input into the candidate category classification model, the cognitive factor of the parent category to which the text I to be annotated belongs as a category a and the cognitive factor of the parent category to which the text I to be annotated belongs as B are output.
And calculating the product of the similarity and the cognitive factor corresponding to each class, and taking the set formed by each calculated product as a candidate class probability value set. For example, the parent category has m total, respectively c 1 ,c 2 ,。。。,c m The initial set of probability values for the candidate class of text to be annotated may be represented as p= { P 1 ,p 2 ,…,p m The cognitive factor set is denoted as p= { P } 1 ,p 2 ,…,p m Calculating the similarity between the text to be annotated and m candidate categories, wherein the similarity set can be expressed as U= { mu } 12 ,..,μ m The probability set of the candidate class may be denoted as P s ={p 11 ,p 22 …p mm }。
In order to facilitate the subsequent quick determination of the actual text category of the text to be annotated, the candidate category of the text to be annotated can be selected according to a preset rule and a candidate category probability set.
In a specific implementation, the selection may be performed according to a preset rule, where the preset rule may be set according to needs, for example, probability values in a probability set of candidate categories may be arranged in a descending order, a preset number of probability values may be selected from the sorted probability set, and a candidate category corresponding to the selected probability value is used as a candidate category of the text to be annotated, where the preset number includes at least 2. The number of the preset candidate categories can be set according to actual needs.
The probability value larger than a preset threshold value can be selected from the candidate class probability set, the candidate class corresponding to the selected probability value is used as the candidate class of the text to be annotated, the preset threshold value can be set according to requirements, and the larger the preset threshold value is, the more candidate classes with interference are filtered.
Step 102: determining a classification model according to the candidate category; the classification model is obtained through training according to each first sample text and the bottom sub-category corresponding to each first sample text, wherein the bottom sub-category is positioned at the bottom of the category hierarchy structure where the candidate category is positioned, the candidate category is positioned at the top of the category hierarchy structure, the candidate category is a father category, and the category hierarchy structure at least comprises 2 layers of categories.
In a specific implementation, according to a candidate category and a category hierarchy structure corresponding to the candidate category, acquiring a bottom subcategory set formed by all bottom subcategories in the category hierarchy structure; and determining a classification model according to the bottom subcategory set.
Specifically, the category hierarchy corresponding to the candidate category is preset, after the candidate category is determined, the category hierarchy corresponding to the candidate category can be obtained, and all the bottom sub-categories in the category hierarchy corresponding to the candidate category form a bottom sub-category set. Wherein the category hierarchy includes at least 2 levels of categories. The process of determining the underlying subcategory set is described below in one specific example:
the candidate categories obtained by searching are A and B, the category hierarchy structure corresponding to the candidate category is shown in fig. 3, the middle subcategory is ignored, all the bottom subcategories of the candidate category are obtained, for example, a1 to a3 and B1 to B4 in fig. 3 are all the bottom subcategories of the candidate category, and a1 to a3 and B1 to B4 are combined into a bottom subcategory set G { a1, a2, a3, B1, B2, B3 and B4}.
There are various ways to determine the classification model based on the underlying subcategory set. In a specific implementation, the classification model is determined according to the bottom subcategory set and a preset corresponding relation between the bottom subcategory set and the classification model.
Specifically, according to all the bottom subcategories in the overall category hierarchy structure of the text, the classification model corresponding to each possible bottom subcategory set can be trained in advance. And storing the corresponding relation between the preset bottom subcategory set and the classification model, so that after the bottom subcategory set is determined, a proper classification model can be quickly determined. Wherein the stored classification models may be aggregated to form a classification model set.
The new classification model may also be retrained based on the size of the underlying subcategory set and the text data corresponding to the underlying subcategory set, and the new trained classification model stored.
For example, if the bottom subcategory set is denoted as G, the stored classification model set C, and the text t to be annotated; searching whether a classification model corresponding to the bottom subcategory set G exists in the classification model set C, if so, acquiring the classification model, and determining the actual text category of the text to be marked by using the classification model; if the classification model B does not exist, training data corresponding to the bottom subcategory set G is acquired according to the scale of the bottom subcategory set G, a new classification model B is obtained through training, and the classification model B is added into the stored classification model set C.
It will be appreciated that when the number of candidate categories is less than n, n may be set according to actual needs, e.g., n is 100; the classification model may be obtained by training in a conventional machine learning manner, and if the number of candidate classes is greater than or equal to n, the classification model may be obtained by training in a deep learning manner, such as a convolutional neural network (Convolutional Neural Networks, abbreviated as "CNN"), a recurrent neural network (recurrent neural network, abbreviated as "RNN"), an antagonistic neural network (Generative Adversarial Nets, abbreviated as "GAN"), and the like.
Step 103: and determining the actual text category of the text to be annotated according to the text to be annotated and the classification model, and annotating the text to be annotated according to the actual text category.
And taking the text to be marked as input data of the classification model, and obtaining the actual text category of the text to be marked. And labeling the text to be labeled according to the determined actual text category.
Compared with the prior art, the method and the device have the advantages that the types of the text types are huge, the size of the type used for determining the text to be marked can be reduced by searching the candidate type corresponding to the text to be marked, meanwhile, in the process of searching the candidate type, the obtained cognitive factors are combined with the similarity values, so that the similarity is corrected, the accuracy of searching each type of the text to be marked is improved, and the number of the bottom sub-types of the candidate type corresponding to the text to be marked is far smaller than that of all the bottom sub-types, so that the actual text type of the text to be marked can be determined from the bottom sub-types of the candidate type rapidly through a classification model, and the determination speed of determining the type of the text to be marked is improved; the classification model can be determined through the candidate category corresponding to the text to be marked, so that the classification model for determining the text to be marked has more pertinence, and the accuracy of determining the category of the text to be marked is improved; the candidate category is at the top layer of the category hierarchy and is a father category, and the category hierarchy comprises at least 2 layers of categories, and the sub categories of the text to be marked in each layer are judged in sequence without the sequence of the category hierarchy, but all the bottom sub categories of the corresponding candidate category are obtained directly, so that the influence caused by the error of category judgment can be reduced, the probability of marking error occurrence can be reduced, and the marking accuracy is improved.
A second embodiment of the application relates to a method of text category labeling. The second embodiment is a further improvement to the first embodiment, and the main improvement is that: in the second embodiment of the present application, a classification model corresponding to the bottom subcategory set may be obtained through training according to the bottom subcategory set, and after the classification model is obtained through training, the classification model is saved, so as to enrich the number of classification models, and reduce the labeling cost. The flow of training the classification model in this embodiment is shown in fig. 4.
Step 201: and acquiring a first text sample set corresponding to the bottom subcategory set according to the bottom subcategory set, wherein the first text sample set comprises a first sample text corresponding to each bottom subcategory.
Specifically, a first sample set corresponding to the bottom subcategory set may be collected according to the bottom subcategory set, where the first sample set includes first sample texts marked with the bottom subcategory.
Step 202: and taking each first text sample as input data of the classification model, and taking the bottom subcategory corresponding to each first text sample as output data of the classification model.
Step 203: and training to obtain a classification model according to the input data and the output data.
Specifically, the input data are each first sample text, the output data are bottom subcategories corresponding to each first sample text, the classification model continuously optimizes parameters in the corresponding relation between the input data and the output data in a machine learning mode, and finally the parameters are trained to the classification model. The machine learning mode can be a mode of supporting a vector machine, a random forest, a long-short-term memory neural network and the like.
Step 204: and storing the classification model obtained by training and the corresponding relation between the classification model and the bottom subcategory set.
Specifically, after the classification model is obtained, the classification model and the corresponding relation between the classification model and the bottom subcategory set can be saved, so that the number of the classification models is enriched, and the judgment of the text category to be marked can be more accurately performed.
According to the text category labeling method provided by the embodiment, the classification model is obtained through training according to the first text sample set corresponding to the bottom subcategory set, so that the classification model is more targeted, and the speed of determining the actual text category of the text to be labeled by using the classification model is improved. The classification model obtained through training is stored in real time, so that the variety of the classification model can be continuously enriched, and the accuracy of determining the actual text type of the text to be marked by using the classification model in the follow-up process is improved.
The above steps of the methods are divided, for clarity of description, and may be combined into one step or split into multiple steps when implemented, so long as they include the same logic relationship, and they are all within the protection scope of this patent; it is within the scope of this patent to add insignificant modifications to the algorithm or flow or introduce insignificant designs, but not to alter the core design of its algorithm and flow.
A third embodiment of the present application relates to an electronic device, a specific structure of which is shown in fig. 5, including: at least one processor 301; and a memory 302 communicatively coupled to the at least one processor 301; the memory 302 stores instructions executable by the at least one processor 301, the instructions being executable by the at least one processor 301 to enable the at least one processor 301 to perform the method of text category tagging described above.
Where the memory 302 and the processor 301 are connected by a bus, the bus may comprise any number of interconnected buses and bridges, the buses linking together various circuits of the one or more processors 301 and the memory 302. The bus may also link together various other circuits such as peripheral devices, voltage regulators, power management circuits, etc., as are well known in the art and, therefore, will not be further described herein. The bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or may be a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor 301 is transmitted over a wireless medium via an antenna, which further receives the data and transmits the data to the processor 301.
The processor 301 is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And memory 302 may be used to store data used by the processor in performing operations.
Those skilled in the art will appreciate that all or part of the steps in implementing the methods of the embodiments described above may be implemented by a program stored in a storage medium, including instructions for causing a device (which may be a single-chip microcomputer, a chip or the like) or a processor (processor) to perform all or part of the steps of the methods of the embodiments of the application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples of carrying out the application and that various changes in form and details may be made therein without departing from the spirit and scope of the application.

Claims (9)

1. A method for labeling text categories, comprising:
searching candidate categories corresponding to the text to be annotated;
determining a classification model according to the candidate category, wherein the classification model is obtained through training according to each first sample text and a bottom sub-category corresponding to each first sample text, the bottom sub-category is positioned at the bottom of a category hierarchy structure where the candidate category is positioned, the candidate category is positioned at the top of the category hierarchy structure and is a father category, and the category hierarchy structure at least comprises 2 layers of categories;
determining the actual text category of the text to be annotated according to the text to be annotated and the classification model, and annotating the text to be annotated according to the actual text category;
the searching for the candidate category corresponding to the text to be annotated specifically comprises the following steps:
acquiring a cognition factor set of the candidate category of the text to be annotated, wherein the cognition factor set comprises a matched candidate category and an initial probability value corresponding to the matched candidate category;
calculating the similarity between the category of the text to be annotated and each candidate category, and gathering each similarity to obtain a similarity set;
determining the candidate category probability set according to the cognition factor set and the similarity set;
wherein the determining the candidate category probability set according to the cognition factor set and the similarity set includes:
calculating products of the similarity corresponding to each class and the cognitive factors according to the cognitive factor set and the similarity set, and determining a set formed by each calculated product as the candidate class probability value set;
and selecting the candidate category of the text to be annotated according to a preset rule and the candidate category probability set.
2. The method for labeling text categories according to claim 1, wherein the obtaining the set of cognitive factors of the candidate category of the text to be labeled specifically comprises:
inputting the text to be annotated into a preset initial candidate class classification model to obtain a candidate class cognition factor set of the text to be annotated, wherein the initial candidate class classification model is obtained according to each second sample text and the candidate class training corresponding to each second sample text.
3. The method for labeling text according to claim 1, wherein selecting the candidate category of the text to be labeled according to a preset rule and the candidate category probability set specifically comprises:
selecting a probability value larger than a preset threshold from the candidate category probability set, and taking a candidate category corresponding to the selected probability value as the candidate category of the text to be annotated;
or,
and arranging probability values in a descending order in the probability set of the candidate categories, selecting a preset number of probability values from the ordered probability set, and taking the candidate categories corresponding to the selected probability values as the candidate categories of the text to be annotated, wherein the preset number at least comprises 2.
4. A method of text category labeling according to any of claims 1-3, characterized in that the determining a classification model from the candidate categories comprises in particular:
acquiring a bottom subcategory set formed by all bottom subcategories in a category hierarchy according to the candidate category and the category hierarchy corresponding to the candidate category;
and determining a classification model according to the bottom subcategory set.
5. The method for labeling text categories according to claim 4, wherein determining a classification model according to the bottom subcategory set specifically comprises:
and determining the classification model according to the bottom subcategory set and the preset corresponding relation between the bottom subcategory set and the classification model.
6. The method for labeling text categories according to claim 4, wherein the training process of the classification model specifically comprises:
according to the bottom subcategory set, a first sample set corresponding to the bottom subcategory set is obtained, wherein the first sample set comprises the first sample text corresponding to each bottom subcategory;
taking each first text sample as input data of the classification model, and taking the bottom subcategory corresponding to each first text sample as output data of the classification model;
and training to obtain the classification model according to the input data and the output data.
7. The method of text category labeling of claim 6, wherein after training to obtain the classification model and before the determining a classification model from the underlying set of subcategories, the method of text category labeling further comprises:
and storing the classification model obtained through training and the corresponding relation between the classification model and the bottom subcategory set.
8. An electronic device, comprising:
at least one processor; the method comprises the steps of,
a memory communicatively coupled to the at least one processor; wherein,,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of text category labeling of any of claims 1-7.
9. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the method of text category labeling of any of claims 1 to 7.
CN201910456149.3A 2019-05-29 2019-05-29 Text category labeling method, electronic equipment and readable storage medium Active CN110347821B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910456149.3A CN110347821B (en) 2019-05-29 2019-05-29 Text category labeling method, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910456149.3A CN110347821B (en) 2019-05-29 2019-05-29 Text category labeling method, electronic equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN110347821A CN110347821A (en) 2019-10-18
CN110347821B true CN110347821B (en) 2023-08-25

Family

ID=68174432

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910456149.3A Active CN110347821B (en) 2019-05-29 2019-05-29 Text category labeling method, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN110347821B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111680155A (en) * 2020-05-13 2020-09-18 新华网股份有限公司 Text classification method and device, electronic equipment and computer storage medium
CN112001169B (en) * 2020-07-17 2022-03-25 北京百度网讯科技有限公司 Text error correction method and device, electronic equipment and readable storage medium
CN112417857A (en) * 2020-12-02 2021-02-26 北京华彬立成科技有限公司 Patent text analysis method and device, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016162155A (en) * 2015-03-02 2016-09-05 本田技研工業株式会社 Electronic manual display system, terminal device and program
CN107273295A (en) * 2017-06-23 2017-10-20 中国人民解放军国防科学技术大学 A kind of software problem reporting sorting technique based on text randomness
CN107679035A (en) * 2017-10-11 2018-02-09 石河子大学 A kind of information intent detection method, device, equipment and storage medium
CN108804512A (en) * 2018-04-20 2018-11-13 平安科技(深圳)有限公司 Generating means, method and the computer readable storage medium of textual classification model
CN108875051A (en) * 2018-06-28 2018-11-23 中译语通科技股份有限公司 Knowledge mapping method for auto constructing and system towards magnanimity non-structured text
CN109614703A (en) * 2018-12-11 2019-04-12 南京天航智能装备研究院有限公司 A kind of multi- disciplinary integrated modeling of the electric-hydraulic combined steering system of automobile and optimization method

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7024408B2 (en) * 2002-07-03 2006-04-04 Word Data Corp. Text-classification code, system and method
US20100169243A1 (en) * 2008-12-27 2010-07-01 Kibboko, Inc. Method and system for hybrid text classification
US20180060728A1 (en) * 2016-08-31 2018-03-01 Microsoft Technology Licensing, Llc Deep Embedding Forest: Forest-based Serving with Deep Embedding Features
US10896385B2 (en) * 2017-07-27 2021-01-19 Logmein, Inc. Real time learning of text classification models for fast and efficient labeling of training data and customization

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016162155A (en) * 2015-03-02 2016-09-05 本田技研工業株式会社 Electronic manual display system, terminal device and program
CN107273295A (en) * 2017-06-23 2017-10-20 中国人民解放军国防科学技术大学 A kind of software problem reporting sorting technique based on text randomness
CN107679035A (en) * 2017-10-11 2018-02-09 石河子大学 A kind of information intent detection method, device, equipment and storage medium
CN108804512A (en) * 2018-04-20 2018-11-13 平安科技(深圳)有限公司 Generating means, method and the computer readable storage medium of textual classification model
CN108875051A (en) * 2018-06-28 2018-11-23 中译语通科技股份有限公司 Knowledge mapping method for auto constructing and system towards magnanimity non-structured text
CN109614703A (en) * 2018-12-11 2019-04-12 南京天航智能装备研究院有限公司 A kind of multi- disciplinary integrated modeling of the electric-hydraulic combined steering system of automobile and optimization method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于关键词相似度的短文本分类方法研究;张振豪等;《计算机应用研究》;第37卷(第1期);第26-29页 *

Also Published As

Publication number Publication date
CN110347821A (en) 2019-10-18

Similar Documents

Publication Publication Date Title
CN111444320B (en) Text retrieval method and device, computer equipment and storage medium
CN108463795B (en) Self-service classification system
El Kourdi et al. Automatic Arabic document categorization based on the Naïve Bayes algorithm
CN105893609B (en) A kind of mobile APP recommended method based on weighted blend
CN106940726B (en) Creative automatic generation method and terminal based on knowledge network
CN110347821B (en) Text category labeling method, electronic equipment and readable storage medium
CN108845988B (en) Entity identification method, device, equipment and computer readable storage medium
CN106997341B (en) A kind of innovation scheme matching process, device, server and system
KR102069621B1 (en) Apparatus and Method for Documents Classification Using Documents Organization and Deep Learning
CN106294783A (en) A kind of video recommendation method and device
US20080168056A1 (en) On-line iterative multistage search engine with text categorization and supervised learning
US11256991B2 (en) Method of and server for converting a categorical feature value into a numeric representation thereof
CN108846097B (en) User interest tag representation method, article recommendation device and equipment
JP2008257732A (en) Method for document clustering or categorization
CN106874292A (en) Topic processing method and processing device
CN112749330B (en) Information pushing method, device, computer equipment and storage medium
CN112650923A (en) Public opinion processing method and device for news events, storage medium and computer equipment
CN110032631B (en) Information feedback method, device and storage medium
CN110019794A (en) Classification method, device, storage medium and the electronic device of textual resources
US20190287018A1 (en) Categorization for a global taxonomy
CN110347701B (en) Target type identification method for entity retrieval query
CN103412888A (en) Point of interest (POI) identification method and device
CN114238573B (en) Text countercheck sample-based information pushing method and device
CN111159414A (en) Text classification method and system, electronic equipment and computer readable storage medium
CN110008309A (en) A kind of short phrase picking method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant