CN110851598A - Text classification method and device, terminal equipment and storage medium - Google Patents

Text classification method and device, terminal equipment and storage medium Download PDF

Info

Publication number
CN110851598A
CN110851598A CN201911045874.8A CN201911045874A CN110851598A CN 110851598 A CN110851598 A CN 110851598A CN 201911045874 A CN201911045874 A CN 201911045874A CN 110851598 A CN110851598 A CN 110851598A
Authority
CN
China
Prior art keywords
word
text
texts
identified
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911045874.8A
Other languages
Chinese (zh)
Other versions
CN110851598B (en
Inventor
赵洋
王宇
王亚奇
王瑗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Value Online Information Technology Co Ltd
Original Assignee
Shenzhen Value Online Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Value Online Information Technology Co Ltd filed Critical Shenzhen Value Online Information Technology Co Ltd
Priority to CN201911045874.8A priority Critical patent/CN110851598B/en
Publication of CN110851598A publication Critical patent/CN110851598A/en
Application granted granted Critical
Publication of CN110851598B publication Critical patent/CN110851598B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Abstract

The embodiment of the application is applicable to the technical field of text processing, and provides a text classification method, a text classification device, terminal equipment and a storage medium, wherein the method comprises the following steps: acquiring a plurality of texts to be classified; according to the object name of the object to be identified, identifying a target text having an association relation with the object to be identified from the plurality of texts to be classified, wherein the object to be identified has corresponding attribute information; determining a keyword set of the object to be recognized based on the target text; generating a feature word set corresponding to the attribute information based on the keyword sets of a plurality of objects to be identified with the same attribute information; and classifying the plurality of texts to be classified according to the feature word set. The embodiment can utilize a large amount of unsupervised text data for text classification and label acquisition.

Description

Text classification method and device, terminal equipment and storage medium
Technical Field
The present application belongs to the technical field of text processing, and in particular, to a text classification method, apparatus, terminal device, and storage medium.
Background
With the development of information technology, a large amount of data is constantly generated on the internet, and news content is one of them.
Generally, if workers of various companies or enterprises want to know the development situation of the industry, various types of news with the labels of the industry can be searched in a manner of industry classification. However, the current news labeling adopts a supervised learning mode, a large amount of news texts need to be collected in advance, the news texts are labeled manually to form positive and negative samples, and other news texts are labeled in a machine learning mode, so that a large amount of manpower and material resources are consumed in the whole process, and the efficiency is low.
Disclosure of Invention
In view of this, embodiments of the present application provide a text classification method, an apparatus, a terminal device, and a storage medium, so as to solve the problems in the prior art that classification of texts can only be implemented in a supervised learning manner, and a large amount of sample texts need to be labeled in a manual manner, which is time-consuming, labor-consuming, and inefficient.
A first aspect of an embodiment of the present application provides a text classification method, including:
acquiring a plurality of texts to be classified;
according to the object name of the object to be identified, identifying a target text having an association relation with the object to be identified from the plurality of texts to be classified, wherein the object to be identified has corresponding attribute information;
determining a keyword set of the object to be recognized based on the target text;
generating a feature word set corresponding to the attribute information based on the keyword sets of a plurality of objects to be identified with the same attribute information;
and classifying the plurality of texts to be classified according to the feature word set.
A second aspect of an embodiment of the present application provides a text classification apparatus, including:
the acquisition module is used for acquiring a plurality of texts to be classified;
the identification module is used for identifying a target text which has an association relation with the object to be identified from the plurality of texts to be classified according to the object name of the object to be identified, wherein the object to be identified has corresponding attribute information;
the determining module is used for determining a keyword set of the object to be recognized based on the target text;
the generating module is used for generating a feature word set corresponding to the attribute information based on the keyword sets of a plurality of objects to be identified with the same attribute information;
and the classification module is used for classifying the plurality of texts to be classified according to the characteristic word set.
A third aspect of embodiments of the present application provides a terminal device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the text classification method according to the first aspect when executing the computer program.
A fourth aspect of embodiments of the present application provides a computer-readable storage medium, which stores a computer program, and the computer program, when executed by a processor, implements the steps of the text classification method according to the first aspect.
Compared with the prior art, the embodiment of the application has the following advantages:
according to the method and the device for classifying the texts, the plurality of texts to be classified are obtained, the target texts which are associated with the objects to be recognized are recognized from the plurality of texts to be classified according to the object names of the objects to be recognized, the keyword sets of the objects to be recognized can be determined based on the target texts, then the keyword sets of the objects to be recognized with the same attribute information are collected into the feature word sets corresponding to the attribute information, and therefore the texts can be classified according to the obtained feature word sets. According to the embodiment, the target text content which is strongly related to the object to be recognized can be extracted according to a large amount of text contents which do not need to be marked, a large amount of unsupervised text data is effectively utilized for text classification and label acquisition, and compared with a supervised form, a large amount of manpower and material resources are saved, and the application scene is wider; secondly, extracting keywords associated with the object to be identified in the target text to form a group of keyword sets corresponding to the object to be identified, and then generating a feature word set corresponding to the attribute information according to the attribute information of the object to be identified, so that the relationship between the attribute information and the object to be identified can be conveniently utilized, and the generation speed of the feature word set is accelerated.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the embodiments or the description of the prior art will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.
FIG. 1 is a flow chart illustrating steps of a method for classifying text according to an embodiment of the present application;
FIG. 2 is a flow chart illustrating steps of another method of text classification according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a text classification apparatus according to an embodiment of the present application;
fig. 4 is a schematic diagram of a terminal device according to an embodiment of the present application.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. However, it will be apparent to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
The technical solution of the present application will be described below by way of specific examples.
Referring to fig. 1, a schematic flow chart illustrating steps of a text classification method according to an embodiment of the present application is shown, which may specifically include the following steps:
s101, obtaining a plurality of texts to be classified;
it should be noted that the present embodiment may be applied to a terminal device. Namely, the text is processed through the terminal equipment, so that the classification of each text is realized. The terminal device in this embodiment may be a desktop computer, a notebook computer, or other electronic devices, and the specific type of the terminal device is not limited in this embodiment.
The text to be classified in this embodiment may be a news text, such as sports news, financial news, and the like collected through various news websites; other types of bulletin texts may be used, such as texts collected through public channels, such as annual reports of listed companies, and the type of the texts is not limited in this embodiment.
S102, identifying a target text having an association relation with an object to be identified from the plurality of texts to be classified according to the object name of the object to be identified;
in the present embodiment, the object name of the object to be recognized may be a company name. That is, the target text containing a company name can be found from the texts to be classified by identifying the company name of a certain company.
It should be noted that there may be many texts containing a company name. For example, in a large number of news texts, many news may contain a company name, but the whole news is not about the company as the main report object. Therefore, after all news texts containing the company name are identified, the target texts with strong association relationship with the company can be continuously screened from the news texts.
In the specific implementation, whether the text has stronger relevance with the object to be recognized can be determined according to the number of times that the object name of the object to be recognized appears in a certain text; alternatively, the target text having a strong association with the object to be recognized may be screened from the multiple texts by some specific algorithm, such as Named Entity Recognition (NER) algorithm. The present embodiment does not limit the specific manner of filtering the target text.
S103, determining a keyword set of the object to be recognized based on the target text;
after the target text is recognized, a keyword set matched with the object to be recognized can be determined from the multiple target texts through a keyword extraction technology. The keywords in the keyword set may be some words that are more commonly used in describing the object to be recognized.
For example, in the foregoing example, if a target news text having a strong association with a company is screened from a plurality of news texts according to the name of the company, the keywords that are most frequently used for reporting the company or describing information related to the company can be identified, and the keywords together form the keyword set of the company.
S104, generating a feature word set corresponding to attribute information based on the keyword sets of a plurality of objects to be identified with the same attribute information;
generally, the object to be recognized has corresponding attribute information. For example, for a company, its industry can be regarded as the attribute information of the company.
Therefore, after the keyword sets of the objects to be recognized are generated, the feature word sets corresponding to the attribute information can be obtained by summarizing the keyword sets of the objects to be recognized with the same attribute information.
For example, after the keyword set of each company is generated, the keyword sets of a plurality of companies belonging to the same industry may be collected to obtain the feature word set of the industry. Each feature word in the set of feature words described above may be considered as the word most commonly used when used to describe the industry.
And S105, classifying the plurality of texts to be classified according to the feature word set.
It should be noted that each feature word in the feature word set may be regarded as a word that can be used to describe an object to be recognized that has certain attribute information. Therefore, each text can be classified according to the feature words in the feature word set.
In the embodiment of the application, after a plurality of texts to be classified are obtained and a target text having an association relation with an object to be recognized is recognized from the plurality of texts to be classified according to the object name of the object to be recognized, a keyword set of the object to be recognized can be determined based on the target text, and then the keyword sets of the plurality of objects to be recognized having the same attribute information are collected into a feature word set corresponding to the attribute information, so that the texts can be classified according to the obtained feature word set. According to the embodiment, the target text content which is strongly related to the object to be recognized can be extracted according to a large amount of text contents which do not need to be marked, a large amount of unsupervised text data is effectively utilized for text classification and label acquisition, and compared with a supervised form, a large amount of manpower and material resources are saved, and the application scene is wider; secondly, extracting keywords associated with the object to be identified in the target text to form a group of keyword sets corresponding to the object to be identified, and then generating a feature word set corresponding to the attribute information according to the attribute information of the object to be identified, so that the relationship between the attribute information and the object to be identified can be conveniently utilized, and the generation speed of the feature word set is accelerated.
Referring to fig. 2, a schematic flow chart illustrating steps of another text classification method according to an embodiment of the present application is shown, which may specifically include the following steps:
s201, acquiring a plurality of texts to be classified;
it should be noted that the present embodiment may be applied to a terminal device. Namely, the text is processed through the terminal equipment, so that the classification of each text is realized.
For ease of understanding, the present embodiment will be described later by taking classification of news text as an example.
In this embodiment, after obtaining the texts to be classified, a plurality of texts to be classified may be first converted into plain text formats, and the character formats in the converted texts may be unified. For example, the converted space, special characters, etc. are deleted.
S202, identifying a plurality of first texts containing object names of the objects to be identified from the plurality of texts to be classified; counting the frequency of the object name of the object to be identified appearing in each first text; determining the position of the object name of the object to be recognized appearing in each first text;
in this embodiment, a Named Entity Recognition (NER) algorithm may be adopted to recognize the preprocessed news text, and filter out a target news text containing a specific object to be recognized. The NER algorithm can process the text by marking the names of the company organizations and combining the model trained by the sequence marking algorithm, and the effective entity company names can be obtained.
In a specific implementation, the NER algorithm may be used to first identify multiple copies of the first text containing the object name of the object to be identified from multiple copies of the text to be classified. For example, a first text containing a certain company a is identified from a plurality of news texts.
Then, the number of times of the company a appearing in each first text and the specific position of each appearing are counted.
S203, calculating a correlation coefficient of each first text according to the occurrence times and the occurrence positions of the object names of the objects to be recognized, and preset times weight values and position weight values; identifying a first text of which the correlation coefficient is greater than a preset correlation threshold value as a target text having an association relation with the object to be identified;
in this embodiment, for different occurrence times and different positions where the first text occurs, corresponding weight values may be preset, and then the correlation coefficient of each first text may be calculated according to the occurrence times, the positions where the first text occurs, and the respective corresponding weight values.
For example, counting the number x of each company name in each news text, and then calculating the position importance p of each company name in each news text (the value of p can be preset, for example, the company name appears in the title position, the weight is 10, appears in the head position, and the weight is 8, which are sequentially reduced), the correlation between the news text and each company can be represented as y ═ ax + bp, where a and b are corresponding weight parameters respectively. The importance of the parameter a may be slightly lower than b, that is, a is smaller than b, a and b may be specified to be appropriate values, or an appropriate value may be learned through data, which is not limited in this embodiment.
After the relevance coefficient of each news text is calculated in the above manner, if the relevance coefficient y of a certain news is greater than a set threshold k, the news text can be considered to be strongly related to the company.
By calculating the correlation coefficient, news texts in which only a company name is mentioned in some news texts and the reported content is irrelevant to the company can be excluded.
S204, segmenting the target text, and identifying the part-of-speech information of each word after segmentation; deleting the words hitting any stop word in the stop word list according to a preset stop word list; deleting the words of which the part-of-speech information does not belong to the noun according to the part-of-speech information of each word;
in a specific implementation, for a target news text strongly related to a certain company, a word segmentation tool can be used for segmenting each target text, and useless stop words are filtered out through a preset stop word list. Because the text label words are generally noun words, the words obtained after word segmentation can mainly retain the noun words according to the parts of speech, and delete the words of other parts of speech.
S205, constructing a word co-occurrence matrix based on the remaining words, and deleting the words with the word frequency less than a preset word frequency threshold value in the word co-occurrence matrix to obtain a target word co-occurrence matrix;
in this embodiment, for the remaining words in the multi-object news text strongly related to a certain company, the keywords may be identified by constructing a word co-occurrence matrix.
In a specific implementation, the remaining words in all the target texts may be determined first, then the remaining words in all the target texts are used as row data of a word co-occurrence matrix, the remaining words in each target text are column data of the word co-occurrence matrix, if the target words in the target columns belong to the remaining words in the current target text, the intersection position of the row where the current target text is located and the target column is marked as 1, otherwise, the intersection position is marked as 0, and the word co-occurrence matrix is obtained, where the target column may be any column of the word co-occurrence matrix.
That is, all the strongly related news deduplicated words of a company can be constructed into a word list, the word list forms rows and columns of a matrix, when a word in a column and some words in all the rows appear in the same news, the position of the row and column can be marked as 1, and thus, a word co-occurrence matrix is repeatedly formed.
Meanwhile, for partial low-frequency words, for example, words with the number less than 3, the low-frequency words can be filtered out, and the subsequent calculation amount is reduced.
S206, extracting a plurality of keywords in the target word co-occurrence matrix to form a keyword set of the object to be identified;
in this embodiment, a keyword set of each object to be recognized may be extracted from the target word co-occurrence matrix by using a matrix decomposition technique such as Principal Component Analysis (PCA).
The PCA technology can extract main components from the redundant features by transforming the matrix to obtain a dimensionality reduction result, that is, main word contents are extracted from the redundant word list to form a keyword set of the object to be identified.
S207, summarizing the keyword sets of the multiple objects to be identified with the same attribute information to obtain an initial feature word set; clustering the initial characteristic word set, and extracting the words with the maximum word frequency in various types to form a characteristic word set corresponding to the attribute information;
similarly, taking an object to be identified as a certain company as an example, since each company at least belongs to a certain industry, the industry can be regarded as attribute information of the company. Therefore, after the keyword sets of each company are generated, the keyword sets of a plurality of companies belonging to the same industry can be collected together to obtain the initial feature word set of the industry.
The initial feature word set comprises a plurality of feature words and word frequencies of the feature words, the word frequencies of the feature words represent the importance degrees of the feature words, and the calculation of the word frequencies can be obtained by accumulating the number of the same words when the keyword sets of various companies are combined into the feature word set of the industry.
Then, by utilizing a word vector model trained by a large amount of text contents and based on K-Means topic clustering, mining and extracting, word duplicate removal and semantic independent processing are carried out on each feature word in the initial feature word set, so as to obtain a feature word set finally belonging to the industry.
The process of mining and extracting, removing duplicate words and semantic independence can be understood as that word vectors are used for clustering, feature words with semantic similarity exceeding a certain numerical value (for example, 0.6) are clustered by calculating cosine distances, the words with the largest word frequency in each class are extracted and stored, and feature word sets corresponding to various industries can be obtained by removing other words.
And S208, classifying the plurality of texts to be classified according to the feature word set.
After the feature word sets of each industry are obtained, classifying the text to be classified by adopting a (Term Frequency-Inverse document Frequency, TF-IDF) algorithm.
Generally, TF is the ratio of the number of times a word appears in a certain text to the total number of words in all texts, and its role is to indicate the number of times the word appears in a text, and more times indicate that it is more important; the IDF is the ratio of the number of all texts to the number of texts containing a certain word, and the TF-IDF value of the certain word is obtained by multiplying the two values. The larger the TF-IDF value of a word in a text, the higher the importance of the word in the text in general.
In this embodiment, for any feature word in the feature word set, a word frequency-inverse text frequency index, that is, a TF-IDF index, of the feature word in the text to be classified may be calculated, and after a target feature word whose TF-IDF index is greater than a preset index threshold is extracted, the text to be classified may be classified according to the target feature word.
According to the embodiment, a large amount of unsupervised text data is used for text classification and label acquisition, so that compared with a supervised form, a large amount of manpower and material resources are saved, and the application scene is wider; meanwhile, the embodiment makes full use of the prior knowledge of the entity relationship between the industry and the company, accelerates the data recognition of the algorithm, and combines a plurality of text processing technologies, thereby effectively solving the problem of industry news classification, and after labels are marked for the industry and news, the industry news content can be recommended according to the label word content concerned by the user, so that the user can conveniently find the text content matched with the expected content in time.
It should be noted that, the sequence numbers of the steps in the foregoing embodiments do not mean the execution sequence, and the execution sequence of each process should be determined by the function and the inherent logic of the process, and should not constitute any limitation on the implementation process of the embodiments of the present application.
Referring to fig. 3, a schematic diagram of a text classification apparatus according to an embodiment of the present application is shown, which may specifically include the following modules:
an obtaining module 301, configured to obtain multiple texts to be classified;
an identifying module 302, configured to identify, according to an object name of an object to be identified, a target text having an association relationship with the object to be identified from the multiple texts to be classified, where the object to be identified has corresponding attribute information;
a determining module 303, configured to determine, based on the target text, a keyword set of the object to be recognized;
a generating module 304, configured to generate a feature word set corresponding to attribute information based on a keyword set of a plurality of objects to be identified with the same attribute information;
a classification module 305, configured to classify the multiple texts to be classified according to the feature word set.
In this embodiment, the apparatus may further include the following modules:
and the preprocessing module is used for converting the plurality of texts to be classified into a plain text format and unifying the character formats in the converted texts.
In this embodiment, the identification module 302 may specifically include the following sub-modules:
the first text recognition submodule is used for recognizing a plurality of first texts containing the object names of the objects to be recognized from the plurality of texts to be classified;
the number counting submodule is used for counting the number of times of the object name of the object to be identified appearing in each first text; and the number of the first and second groups,
the position determining submodule is used for determining the position of the object name of the object to be recognized in each first text;
the correlation coefficient calculation submodule is used for calculating the correlation coefficient of each first text according to the occurrence times and the occurrence positions of the object names of the objects to be identified, and preset times weight values and position weight values;
and the target text identification submodule is used for identifying the first text of which the correlation coefficient is greater than a preset correlation threshold as the target text having an association relationship with the object to be identified.
In this embodiment of the present application, the determining module 303 may specifically include the following sub-modules:
the word segmentation sub-module is used for segmenting the target text and identifying the part-of-speech information of each word after word segmentation;
the word updating submodule is used for deleting the words hitting any stop word in the stop word list according to a preset stop word list; deleting the words of which the part-of-speech information does not belong to the noun according to the part-of-speech information of each word;
the word co-occurrence matrix construction submodule is used for constructing a word co-occurrence matrix based on the rest words, deleting the words of which the word frequency is smaller than a preset word frequency threshold value in the word co-occurrence matrix, and obtaining a target word co-occurrence matrix;
and the keyword set generation submodule is used for extracting a plurality of keywords in the target word co-occurrence matrix to form a keyword set of the object to be identified.
In this embodiment of the present application, the word co-occurrence matrix building submodule may specifically include the following units:
the matrix row and column data determining unit is used for determining the remaining words in all the target texts, taking the remaining words in all the target texts as row data of a word co-occurrence matrix, and taking the remaining words in each target text as column data of the word co-occurrence matrix;
and the matrix generating unit is used for marking the intersection position of the line where the current target text is positioned and the target column as 1 if the target words in the target column belong to the remaining words in the current target text, otherwise marking the intersection position as 0 to obtain a word co-occurrence matrix, wherein the target column is any column of the word co-occurrence matrix.
In this embodiment of the application, the generating module 304 may specifically include the following sub-modules:
the keyword summarizing submodule is used for summarizing a keyword set of a plurality of objects to be identified with the same attribute information to obtain an initial characteristic word set;
and the characteristic word set generation submodule is used for clustering the initial characteristic word set and extracting the words with the maximum word frequency in various types to form the characteristic word set corresponding to the attribute information.
In this embodiment, the classification module 305 may specifically include the following sub-modules:
the index calculation submodule is used for calculating a word frequency-inverse text frequency index of the feature words in the text to be classified aiming at any feature word in the feature word set;
the target characteristic word extraction submodule is used for extracting the target characteristic words of which the word frequency-inverse text frequency index is greater than a preset index threshold;
and the text classification submodule is used for classifying the texts to be classified according to the target feature words.
For the apparatus embodiment, since it is substantially similar to the method embodiment, it is described relatively simply, and reference may be made to the description of the method embodiment section for relevant points.
Referring to fig. 4, a schematic diagram of a terminal device according to an embodiment of the present application is shown. As shown in fig. 4, the terminal device 400 of the present embodiment includes: a processor 410, a memory 420, and a computer program 421 stored in the memory 420 and executable on the processor 410. The processor 410, when executing the computer program 421, implements the steps in the various embodiments of the text classification method described above, such as the steps S101 to S105 shown in fig. 1. Alternatively, the processor 410, when executing the computer program 421, implements the functions of each module/unit in the above-mentioned device embodiments, for example, the functions of the modules 301 to 305 shown in fig. 3.
Illustratively, the computer program 421 may be partitioned into one or more modules/units, which are stored in the memory 420 and executed by the processor 410 to accomplish the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which may be used to describe the execution process of the computer program 421 in the terminal device 400. For example, the computer program 421 may be divided into an obtaining module, an identifying module, a determining module, a generating module and a classifying module, and the specific functions of each module are as follows:
the acquisition module is used for acquiring a plurality of texts to be classified;
the identification module is used for identifying a target text which has an association relation with the object to be identified from the plurality of texts to be classified according to the object name of the object to be identified, wherein the object to be identified has corresponding attribute information;
the determining module is used for determining a keyword set of the object to be recognized based on the target text;
the generating module is used for generating a feature word set corresponding to the attribute information based on the keyword sets of a plurality of objects to be identified with the same attribute information;
and the classification module is used for classifying the plurality of texts to be classified according to the characteristic word set.
The terminal device 400 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The terminal device 400 may include, but is not limited to, a processor 410, a memory 420. Those skilled in the art will appreciate that fig. 4 is only one example of a terminal device 400 and does not constitute a limitation of the terminal device 400 and may include more or less components than those shown, or combine certain components, or different components, for example, the terminal device 400 may also include input and output devices, network access devices, buses, etc.
The Processor 410 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 420 may be an internal storage unit of the terminal device 400, such as a hard disk or a memory of the terminal device 400. The memory 420 may also be an external storage device of the terminal device 400, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and so on, provided on the terminal device 400. Further, the memory 420 may also include both an internal storage unit and an external storage device of the terminal device 400. The memory 420 is used for storing the computer program 421 and other programs and data required by the terminal device 400. The memory 420 may also be used to temporarily store data that has been output or is to be output.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same. Although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims (10)

1. A method of text classification, comprising:
acquiring a plurality of texts to be classified;
according to the object name of the object to be identified, identifying a target text having an association relation with the object to be identified from the plurality of texts to be classified, wherein the object to be identified has corresponding attribute information;
determining a keyword set of the object to be recognized based on the target text;
generating a feature word set corresponding to the attribute information based on the keyword sets of a plurality of objects to be identified with the same attribute information;
and classifying the plurality of texts to be classified according to the feature word set.
2. The method of claim 1, further comprising, after the obtaining a plurality of texts to be classified:
and converting the plurality of texts to be classified into plain text formats, and unifying the character formats in the converted texts.
3. The method according to claim 1, wherein the identifying a target text having an association relation with an object to be identified from the plurality of texts to be classified according to the object name of the object to be identified comprises:
identifying a plurality of first texts containing object names of the objects to be identified from the plurality of texts to be classified;
counting the frequency of the object name of the object to be identified appearing in each first text; determining the position of the object name of the object to be recognized appearing in each first text;
calculating a correlation coefficient of each first text according to the number of times of appearance and the position of appearance of the object name of the object to be identified, and a preset number weight value and a preset position weight value;
and identifying the first text with the correlation coefficient larger than a preset correlation threshold value as a target text having an association relation with the object to be identified.
4. The method of claim 1, wherein determining the set of keywords for the object to be recognized based on the target text comprises:
performing word segmentation on the target text, and identifying the part-of-speech information of each word after word segmentation;
deleting the words hitting any stop word in the stop word list according to a preset stop word list; deleting the words of which the part-of-speech information does not belong to the noun according to the part-of-speech information of each word;
constructing a word co-occurrence matrix based on the remaining words, and deleting the words of which the word frequency is smaller than a preset word frequency threshold value in the word co-occurrence matrix to obtain a target word co-occurrence matrix;
and extracting a plurality of keywords in the target word co-occurrence matrix to form a keyword set of the object to be identified.
5. The method of claim 4, wherein constructing a term co-occurrence matrix based on the remaining terms comprises:
determining the remaining words in all the target texts, taking the remaining words in all the target texts as row data of a word co-occurrence matrix, and taking the remaining words in each target text as column data of the word co-occurrence matrix;
and if the target words in the target column belong to the remaining words in the current target text, marking the intersection position of the line where the current target text is located and the target column as 1, otherwise marking the intersection position as 0, and obtaining a word co-occurrence matrix, wherein the target column is any column of the word co-occurrence matrix.
6. The method according to claim 1, wherein the generating a feature word set corresponding to the attribute information based on the keyword sets of the plurality of objects to be recognized having the same attribute information comprises:
summarizing a plurality of keyword sets of objects to be identified with the same attribute information to obtain an initial characteristic word set;
and clustering the initial characteristic word set, and extracting the words with the maximum word frequency in various types to form the characteristic word set corresponding to the attribute information.
7. The method according to claim 1, wherein the classifying the plurality of texts to be classified according to the feature word set comprises:
aiming at any characteristic word in the characteristic word set, calculating a word frequency-inverse text frequency index of the characteristic word in a text to be classified;
extracting target characteristic words of which the word frequency-inverse text frequency index is greater than a preset index threshold;
and classifying the texts to be classified according to the target feature words.
8. A text classification apparatus, comprising:
the acquisition module is used for acquiring a plurality of texts to be classified;
the identification module is used for identifying a target text which has an association relation with the object to be identified from the plurality of texts to be classified according to the object name of the object to be identified, wherein the object to be identified has corresponding attribute information;
the determining module is used for determining a keyword set of the object to be recognized based on the target text;
the generating module is used for generating a feature word set corresponding to the attribute information based on the keyword sets of a plurality of objects to be identified with the same attribute information;
and the classification module is used for classifying the plurality of texts to be classified according to the characteristic word set.
9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the text classification method according to any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the text classification method according to any one of claims 1 to 7.
CN201911045874.8A 2019-10-30 2019-10-30 Text classification method and device, terminal equipment and storage medium Active CN110851598B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911045874.8A CN110851598B (en) 2019-10-30 2019-10-30 Text classification method and device, terminal equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911045874.8A CN110851598B (en) 2019-10-30 2019-10-30 Text classification method and device, terminal equipment and storage medium

Publications (2)

Publication Number Publication Date
CN110851598A true CN110851598A (en) 2020-02-28
CN110851598B CN110851598B (en) 2023-04-07

Family

ID=69599400

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911045874.8A Active CN110851598B (en) 2019-10-30 2019-10-30 Text classification method and device, terminal equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110851598B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111695353A (en) * 2020-06-12 2020-09-22 百度在线网络技术(北京)有限公司 Method, device and equipment for identifying timeliness text and storage medium
CN112668321A (en) * 2020-12-29 2021-04-16 竹间智能科技(上海)有限公司 Keyword extraction method and device, electronic equipment and storage medium
CN112989761A (en) * 2021-05-20 2021-06-18 腾讯科技(深圳)有限公司 Text classification method and device
CN113627182A (en) * 2021-08-10 2021-11-09 深圳平安智汇企业信息管理有限公司 Data matching method and device, computer equipment and storage medium
CN113918708A (en) * 2021-12-15 2022-01-11 深圳市迪博企业风险管理技术有限公司 Abstract extraction method
WO2022116444A1 (en) * 2020-12-01 2022-06-09 平安科技(深圳)有限公司 Text classification method and apparatus, and computer device and medium
CN114997338A (en) * 2022-07-19 2022-09-02 成都数之联科技股份有限公司 Project classification and classification model training method, device, medium and equipment
CN115114913A (en) * 2021-03-18 2022-09-27 马上消费金融股份有限公司 Labeling method, device, equipment and readable storage medium

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106095972A (en) * 2016-06-17 2016-11-09 联动优势科技有限公司 A kind of information classification approach and device
CN106897428A (en) * 2017-02-27 2017-06-27 腾讯科技(深圳)有限公司 Text classification feature extracting method, file classification method and device
CN106909694A (en) * 2017-03-13 2017-06-30 杭州普玄科技有限公司 Tag along sort data capture method and device
CN107766371A (en) * 2016-08-19 2018-03-06 中兴通讯股份有限公司 A kind of text message sorting technique and its device
US20180121444A1 (en) * 2016-11-03 2018-05-03 International Business Machines Corporation Unsupervised information extraction dictionary creation
CN108228869A (en) * 2018-01-15 2018-06-29 北京奇艺世纪科技有限公司 The method for building up and device of a kind of textual classification model
CN109299271A (en) * 2018-10-30 2019-02-01 腾讯科技(深圳)有限公司 Training sample generation, text data, public sentiment event category method and relevant device
CN109408804A (en) * 2018-09-03 2019-03-01 平安科技(深圳)有限公司 The analysis of public opinion method, system, equipment and storage medium
CN109543029A (en) * 2018-09-27 2019-03-29 平安科技(深圳)有限公司 File classification method, device, medium and equipment based on convolutional neural networks
CN109657137A (en) * 2018-11-26 2019-04-19 平安科技(深圳)有限公司 Public sentiment news category model building method, device, computer equipment and storage medium
CN109977226A (en) * 2019-03-14 2019-07-05 南京邮电大学 High-precision file classification method and system based on convolutional neural networks
CN110134792A (en) * 2019-05-22 2019-08-16 北京金山数字娱乐科技有限公司 Text recognition method, device, electronic equipment and storage medium
CN110188344A (en) * 2019-04-23 2019-08-30 浙江工业大学 A kind of keyword extracting method of multiple features fusion
CN110209808A (en) * 2018-08-08 2019-09-06 腾讯科技(深圳)有限公司 A kind of event generation method and relevant apparatus based on text information
CN110377744A (en) * 2019-07-26 2019-10-25 北京香侬慧语科技有限责任公司 A kind of method, apparatus, storage medium and the electronic equipment of public sentiment classification
CN110390044A (en) * 2019-06-11 2019-10-29 平安科技(深圳)有限公司 A kind of searching method and equipment of the similar network page

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106095972A (en) * 2016-06-17 2016-11-09 联动优势科技有限公司 A kind of information classification approach and device
CN107766371A (en) * 2016-08-19 2018-03-06 中兴通讯股份有限公司 A kind of text message sorting technique and its device
US20180121444A1 (en) * 2016-11-03 2018-05-03 International Business Machines Corporation Unsupervised information extraction dictionary creation
CN106897428A (en) * 2017-02-27 2017-06-27 腾讯科技(深圳)有限公司 Text classification feature extracting method, file classification method and device
CN106909694A (en) * 2017-03-13 2017-06-30 杭州普玄科技有限公司 Tag along sort data capture method and device
CN108228869A (en) * 2018-01-15 2018-06-29 北京奇艺世纪科技有限公司 The method for building up and device of a kind of textual classification model
CN110209808A (en) * 2018-08-08 2019-09-06 腾讯科技(深圳)有限公司 A kind of event generation method and relevant apparatus based on text information
CN109408804A (en) * 2018-09-03 2019-03-01 平安科技(深圳)有限公司 The analysis of public opinion method, system, equipment and storage medium
CN109543029A (en) * 2018-09-27 2019-03-29 平安科技(深圳)有限公司 File classification method, device, medium and equipment based on convolutional neural networks
CN109299271A (en) * 2018-10-30 2019-02-01 腾讯科技(深圳)有限公司 Training sample generation, text data, public sentiment event category method and relevant device
CN109657137A (en) * 2018-11-26 2019-04-19 平安科技(深圳)有限公司 Public sentiment news category model building method, device, computer equipment and storage medium
CN109977226A (en) * 2019-03-14 2019-07-05 南京邮电大学 High-precision file classification method and system based on convolutional neural networks
CN110188344A (en) * 2019-04-23 2019-08-30 浙江工业大学 A kind of keyword extracting method of multiple features fusion
CN110134792A (en) * 2019-05-22 2019-08-16 北京金山数字娱乐科技有限公司 Text recognition method, device, electronic equipment and storage medium
CN110390044A (en) * 2019-06-11 2019-10-29 平安科技(深圳)有限公司 A kind of searching method and equipment of the similar network page
CN110377744A (en) * 2019-07-26 2019-10-25 北京香侬慧语科技有限责任公司 A kind of method, apparatus, storage medium and the electronic equipment of public sentiment classification

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
高慧颖 等: "基于改进LDA的在线医疗评论主题挖掘", 《北京理工大学学报》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111695353A (en) * 2020-06-12 2020-09-22 百度在线网络技术(北京)有限公司 Method, device and equipment for identifying timeliness text and storage medium
WO2022116444A1 (en) * 2020-12-01 2022-06-09 平安科技(深圳)有限公司 Text classification method and apparatus, and computer device and medium
CN112668321A (en) * 2020-12-29 2021-04-16 竹间智能科技(上海)有限公司 Keyword extraction method and device, electronic equipment and storage medium
CN112668321B (en) * 2020-12-29 2023-11-07 竹间智能科技(上海)有限公司 Keyword extraction method and device, electronic equipment and storage medium
CN115114913A (en) * 2021-03-18 2022-09-27 马上消费金融股份有限公司 Labeling method, device, equipment and readable storage medium
CN115114913B (en) * 2021-03-18 2024-02-06 马上消费金融股份有限公司 Labeling method, labeling device, labeling equipment and readable storage medium
CN112989761A (en) * 2021-05-20 2021-06-18 腾讯科技(深圳)有限公司 Text classification method and device
CN113627182A (en) * 2021-08-10 2021-11-09 深圳平安智汇企业信息管理有限公司 Data matching method and device, computer equipment and storage medium
CN113918708A (en) * 2021-12-15 2022-01-11 深圳市迪博企业风险管理技术有限公司 Abstract extraction method
CN114997338A (en) * 2022-07-19 2022-09-02 成都数之联科技股份有限公司 Project classification and classification model training method, device, medium and equipment

Also Published As

Publication number Publication date
CN110851598B (en) 2023-04-07

Similar Documents

Publication Publication Date Title
CN110851598B (en) Text classification method and device, terminal equipment and storage medium
CN111104794B (en) Text similarity matching method based on subject term
CN109885692B (en) Knowledge data storage method, apparatus, computer device and storage medium
CN108959431B (en) Automatic label generation method, system, computer readable storage medium and equipment
CN108932294B (en) Resume data processing method, device, equipment and storage medium based on index
Dal Bianco et al. A practical and effective sampling selection strategy for large scale deduplication
CN111797210A (en) Information recommendation method, device and equipment based on user portrait and storage medium
CN110413787B (en) Text clustering method, device, terminal and storage medium
WO2020000717A1 (en) Web page classification method and device, and computer-readable storage medium
CN108573045A (en) A kind of alignment matrix similarity retrieval method based on multistage fingerprint
CN110909123B (en) Data extraction method and device, terminal equipment and storage medium
CN102428467A (en) Similarity-Based Feature Set Supplementation For Classification
CN112395421B (en) Course label generation method and device, computer equipment and medium
CN113268615A (en) Resource label generation method and device, electronic equipment and storage medium
CN114416998A (en) Text label identification method and device, electronic equipment and storage medium
Hossari et al. TEST: A terminology extraction system for technology related terms
CN110532449B (en) Method, device, equipment and storage medium for processing service document
CN110705281B (en) Resume information extraction method based on machine learning
CN114138784A (en) Information tracing method and device based on storage library, electronic equipment and medium
CN112579781A (en) Text classification method and device, electronic equipment and medium
CN115953123A (en) Method, device and equipment for generating robot automation flow and storage medium
CN115525761A (en) Method, device, equipment and storage medium for article keyword screening category
CN111090743B (en) Thesis recommendation method and device based on word embedding and multi-value form concept analysis
CN109255122B (en) Method for classifying and marking thesis citation relation
CN112559739A (en) Method for processing insulation state data of power equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant