CN109344397B - Text feature word extraction method and device, storage medium and program product - Google Patents
Text feature word extraction method and device, storage medium and program product Download PDFInfo
- Publication number
- CN109344397B CN109344397B CN201811020415.XA CN201811020415A CN109344397B CN 109344397 B CN109344397 B CN 109344397B CN 201811020415 A CN201811020415 A CN 201811020415A CN 109344397 B CN109344397 B CN 109344397B
- Authority
- CN
- China
- Prior art keywords
- sub
- words
- target
- library
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a text characteristic word extraction method, which is carried out based on a sample library composed of texts, wherein the sample library comprises sub-sample libraries of different categories, and when characteristic engineering is carried out, the text in each sub-sample library is extracted by using the keywords as target words so as to obtain sub-target word libraries of each sub-sample library; sorting the target words in each sub-target word library according to word frequency; and determining characteristic words from each sub-target word library according to the sorting result so as to obtain a characteristic word library. By the scheme, more effective characteristic words can be obtained, and the value of the characteristic word stock is improved.
Description
Technical Field
The present invention relates to the field of artificial intelligence, and in particular, to a method and apparatus for extracting text feature words, a computer readable storage medium, and a computer program product.
Background
The feature engineering is an indispensable step of constructing a machine learning model with excellent performance, and features effective for the machine learning model are extracted from sample data through the feature engineering.
The sample data of the unstructured text is usually composed of text streams of words, sentences or paragraphs, has unstructured nature and large noise characteristics, and effective characteristic words are extracted from the sample data of the unstructured text through characteristic engineering, so that a characteristic word stock capable of being applied to machine learning is obtained. For the obtained feature word library, the number of the feature words is smaller, and the feature words can represent the features of the text, the value of the feature word library is higher, so that the processing scale of machine learning can be effectively reduced, and meanwhile, the accuracy of machine learning is improved.
At present, in the process of feature engineering on unstructured text, when feature words are extracted, word frequencies are mainly used for screening the feature words, however, the frequency of the words in the text is high, the words do not necessarily represent that the words can represent the features of the text, and the words possibly are invalid feature words, so that the value of a feature word stock is affected.
Disclosure of Invention
In view of the above, an object of the present invention is to provide a method and apparatus for extracting text feature words, a computer readable storage medium, and a computer program product, which improve the value of feature word stock.
In order to achieve the above purpose, the invention has the following technical scheme:
a method for extracting text feature words based on a sample library of text components, the sample library comprising sub-sample libraries of different categories, the method comprising:
extracting keywords from the texts in the sub-sample libraries, and taking the keywords as target words to obtain sub-target word libraries of the sub-sample libraries;
sorting the target words in each sub-target word library according to word frequency;
and determining characteristic words from each sub-target word library according to the sorting result so as to obtain a characteristic word library.
In one possible implementation manner, after obtaining the sub-target word libraries, before sorting the target words in each sub-target word library according to word frequencies, the method further includes:
and eliminating stop words from each sub-target word library.
In one possible implementation manner, after obtaining the sub-target word libraries, before sorting the target words in each sub-target word library according to word frequencies, the method further includes:
and merging the synonyms in each sub-target word library, and deleting the synonyms shared among all the sub-target word libraries.
In one possible implementation manner, the determining, according to the ranking result, feature words from each sub-target word library to obtain a feature word library includes:
selecting a preset number of target words from each sub-target word library as characteristic words according to the sequence of word frequency from high to low in the sequencing result so as to obtain a characteristic word library;
the method for determining the preset number of the characteristic words comprises the following steps: and determining the number of the characteristic words selected from each sub-target word library according to the preset proportional relation of the number of the characteristic words corresponding to the sub-target word library of each category and the scale coefficient of the number of the characteristic words.
In one possible implementation manner, in the step of extracting keywords from unstructured text in each sub-sample library, the method for determining the number of keywords includes:
when the text length of the text is smaller than a first threshold value, setting the number of keywords to be a first number value;
setting the number of keywords to a second number value when the text length of the text is greater than a second threshold value, wherein the second threshold value is greater than the first threshold value;
when the text length of the text is between the first threshold and the second threshold, the number of keywords is proportional to the text length.
An apparatus for extracting text feature words, the apparatus comprising:
the sub-target word library acquisition unit is used for extracting keywords from texts in sub-sample libraries of different categories in the sample library, and taking the keywords as target words so as to obtain sub-target word libraries of the sub-sample libraries;
the sorting unit is used for sorting the target words in each sub-target word library according to word frequency;
and the characteristic word stock obtaining unit is used for determining characteristic words from each sub-target word stock according to the sorting result so as to obtain the characteristic word stock.
In one possible implementation manner, the rejection unit is used for rejecting stop words from each sub-target word library.
In one possible implementation, the method further includes: and the synonym processing unit is used for merging synonyms in each sub-target word library, and then deleting the synonyms shared among all the sub-target word libraries.
In one possible implementation manner, in the feature word stock obtaining unit, the selecting, according to the ranking result, a target word from each sub-target word stock as a feature word to obtain a feature word stock includes:
selecting a preset number of target words from each sub-target word library as characteristic words according to the sequence of word frequency from high to low in the sequencing result so as to obtain a characteristic word library;
the method for determining the preset number of the characteristic words comprises the following steps: and determining the number of the characteristic words selected from each sub-target word library according to the preset proportional relation of the number of the characteristic words corresponding to the sub-target word library of each category and the scale coefficient of the number of the characteristic words.
In one possible implementation manner, in the sub-target word library obtaining unit, the method for determining the number of keywords includes:
when the text length of the text is smaller than a first threshold value, setting the number of keywords to be a first number value;
setting the number of keywords to a second number value when the text length of the text is greater than a second threshold value, wherein the second threshold value is greater than the first threshold value;
when the text length of the text is between the first threshold and the second threshold, the number of keywords is proportional to the text length.
A computer readable storage medium having instructions stored therein that, when executed on a terminal device, cause the terminal device to perform the method of extracting text feature words described above.
A computer program product which, when run on a terminal device, causes the terminal device to perform the method of extracting text feature words as described above.
The embodiment of the invention provides a method and a device for extracting text characteristic words, a computer readable storage medium and a computer program product, which are based on a sample library formed by texts, wherein the sample library comprises sub-sample libraries of different categories, and when characteristic engineering is carried out, keywords are extracted from texts in the sub-sample libraries, and the keywords are used as target words so as to obtain sub-target word libraries of the sub-sample libraries; sorting the target words in each sub-target word library according to word frequency; and determining characteristic words from each sub-target word library according to the sorting result so as to obtain a characteristic word library. In the scheme, firstly, extracting keywords from texts in each sub-sample library, wherein the extracting of the keywords is based on the occurrence frequency of words and the semantic relation of text contexts, the keywords can further represent the characteristics of the sample texts, the keywords are used as target words, further, the characteristic words determined according to word frequency can further represent the characteristics of the texts, more effective characteristic words are obtained, and the value of the characteristic word library is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for extracting text feature words according to an embodiment of the invention;
fig. 2 is a flow chart illustrating a method for extracting text feature words in a second embodiment of the present invention;
fig. 3 shows a schematic structural diagram of an apparatus for extracting text feature words according to an embodiment of the present invention.
Detailed Description
In order that the above objects, features and advantages of the invention will be readily understood, a more particular description of the invention will be rendered by reference to the appended drawings.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways other than those described herein, and persons skilled in the art will readily appreciate that the present invention is not limited to the specific embodiments disclosed below.
As described in the background art, at present, when feature words of a text are extracted, word frequencies are mainly used for screening the feature words, however, the frequency of the words in the text is high, which does not necessarily indicate that the words can represent features of the text, and can be invalid feature words, so that the value of a feature word bank is affected.
Therefore, the method for extracting the text characteristic words is provided, the extraction of the keywords is carried out from the texts in each sub-sample library, the keyword extraction is based on the occurrence frequency of the words and the semantic relation of the text context, the keywords can be used for representing the characteristics of the sample texts, the keywords are used as target words, the characteristic words determined according to the word frequency can be used for representing the characteristics of the texts, more effective characteristic words are obtained, and the value of the characteristic word library is improved.
For a better understanding of the technical solutions and technical effects of the present application, the following detailed description will be made in connection with specific embodiments.
In the text feature word extraction method of the embodiment of the application, the extraction is performed based on a sample library composed of texts, wherein the sample library comprises sub-sample libraries of different categories.
The text in the sample library may be formatted text or non-formatted text, the text is usually composed of characters, numbers, punctuation, various symbols and the like, the characters may be characters of Chinese, english or other languages, the non-structured text is unstructured data in the form of text as data, the specific sources of the text may be various, and the text may be web pages in websites, articles in application programs or texts in databases and the like.
The sample libraries are made up of samples of text that have been marked as different classes of samples that constitute sub-sample libraries of different classes, that is, samples of a certain class of text are included in each sub-sample library. The sub-sample library is a collection of samples of different categories, and the extraction of characteristic words of the samples is carried out based on the sub-sample library, so that effective characteristic words representing texts of various categories can be obtained, and the value of the characteristic word library is improved. For ease of understanding, a specific example is described in which the sample library is composed of news samples of unstructured text, the sample library has been divided into a plurality of sub-sample libraries, the categories of each sub-sample library may be sports, entertainment, finance, etc., the samples in the sub-sample library of sports category are sports news, the samples in the sub-sample library of entertainment category are entertainment news, and the samples in the sub-sample library of finance category are finance news.
Referring to fig. 1, in the method for extracting feature words based on the sub-sample libraries including different categories, first, in step S101, extracting keywords from texts in each sub-sample library, and using the keywords as target words, to obtain sub-target word libraries of each sub-sample library.
For each sub-sample library, extracting the keywords from each sample can be performed by adopting a proper keyword extraction method, preferably, in the embodiment of the application, the keywords are extracted by adopting a texttrank algorithm, and the extracted keywords are used as target words, so that sub-target word libraries of each sub-sample library are obtained. Thus, after keyword extraction, each sub-sample library will correspond to a sub-target word library for further retrieval of feature words therefrom. In the method for extracting the keywords, the keywords are obtained from the text based on the occurrence times of the words and the semantic relation of the context of the text, the target words are obtained by adopting the method for extracting the keywords, the target words can better reflect the characteristics of the text, the target words have stronger relevance with the text, the more effective characteristic words can be obtained later, and the value of the characteristic word library is improved.
When extracting keywords, all keywords of the sample can be extracted, or the number of keywords to be extracted by the sample can be set according to the requirement, in this embodiment of the present application, the number of keywords is set to be proportional to the text length of the sample, specifically, when the text length t of the text is smaller than the first threshold t min Setting the number T of keywords as a first number valueK min The method comprises the steps of carrying out a first treatment on the surface of the When the text length t of the text is greater than the second threshold t max At the time, the number T of keywords is set to be a second number K max The second threshold t max Greater than the first threshold t min The method comprises the steps of carrying out a first treatment on the surface of the When the text length t of the text is at the first threshold t min And the second threshold t max The number of keywords T is proportional to the text length T. The specific setting rule is as follows:
wherein T is the number of keywords, T is the length of the text, and delta is the positive correlation coefficient. It will be appreciated that in a particular application, the first threshold t min May be a minimum threshold value of the text length, when the text length is smaller than the first threshold value t min If the text size is considered too small, the number of keywords can be set to a fixed number value K min The number value may be, for example, 0 or 1; when the text length is greater than the second threshold t max If the text size is considered too large, the number of keywords can be set to another fixed number value K max The number value may for example be selected from values between 25 and 30. When the text length is between the two thresholds, the number of keywords is set to be proportional to the text length, and longer text selects a larger number of keywords. Therefore, the number of the proper keywords can be set, and the situation that the extraction efficiency of the subsequent characteristic words is too low due to the fact that the number of the characteristic words is too large can be avoided.
In step S102, the target words in each sub-target word library are respectively ranked according to word frequency.
In step S103, according to the sorting result, feature words are determined from each sub-target word library, so as to obtain a feature word library.
The term frequency is the frequency of occurrence of the target term in each sub-target term library, and the higher the occurrence frequency is, the higher the weight of the target term is considered, and the target term can be considered as a characteristic term.
For each sub-target word library, the target words can be sequenced according to word frequency by the sub-target word libraries respectively or in parallel, and the processing speed of the sub-target word libraries can be effectively improved by parallel operation, so that the efficiency of selecting the target words is improved. In a specific application, the ranking can be performed according to the order of word frequency from high to low, or the ranking can be performed according to the order of word frequency from low to high, and accordingly, the ranking result is the characteristic words ranked according to the word frequency from high to low, and then a proper number of target words can be sequentially selected from the ranking result to serve as the characteristic words; when the ranking result is the feature words ranked from low to high according to word frequency, a proper number of target words can be selected from the ranking result in reverse order as the feature words, that is, a certain number of target words are selected from the ranking result from high to low according to word frequency as the feature words, so that a feature word library composed of the feature words is obtained, and the feature word library is used as a data set during machine learning. The method for determining the feature word stock in the word frequency mode is based on a statistical method, and feature words which are more relevant to the sample text can be selected from the sub-target word stock, so that the value of the feature words is improved.
When selecting the characteristic words from the sub-target word library, selecting a certain number of target words as the characteristic words, thereby obtaining a characteristic word library, wherein the number of the characteristic words can be determined according to specific needs, for example, the number of the characteristic words can be determined by a word frequency threshold value, a word number threshold value or other suitable methods. In the embodiment of the present application, the preset number of target words to be selected as the feature words is determined by the following method. Specifically, according to the preset proportional relation of the number of the characteristic words corresponding to the sub-target word libraries of each category and the scale coefficient of the number of the characteristic words, the number of the characteristic words selected from each sub-target word library is determined.
In the method for determining the number of the characteristic words, the proportion relation of the characteristic words corresponding to the sub-sample libraries of each category is determined first, the proportion relation can be set according to specific conditions, in some applications, the proportion relation can be determined according to the category attribute of each target word library, the category attribute refers to that each sub-target word library is in parallel or in different relation, and the category attribute can be represented by the tag attribute of the sub-target word library.
For easy understanding, in one example, the sub-sample library is a sports-class sub-sample library, an entertainment-class sub-sample library, and a financial-class sub-sample library, and the sub-target word library corresponding to the sub-sample library is a sports-class sub-target word library, an entertainment-class sub-target word library, and a financial-class sub-target word library, respectively, which are all in parallel relationship, and then the ratio relationship of the target words selected from the news-class sub-target word library, the entertainment-class sub-target word library, and the financial-class sub-target word library may be set to 1:1:1. In another example, the sub-sample libraries are a sub-sample library of a general website, a sub-sample library of a gambling website and a sub-sample library of a fraud website, respectively, the sub-target word libraries corresponding to the sub-sample libraries are a sub-target word library of the general website, a sub-target word library of the gambling website and a sub-target word library of the fraud website, respectively, wherein the general website is a normal website attribute, the fraud website and the gambling website belong to illegal website attributes, and the general website is in a mutually different relationship with the fraud website and the gambling website, and then the proportional relationship among the sub-target word library of the general website, the sub-target word library of the gambling website and the sub-target word library of the fraud website can be set to be 2:1:1.
And then, after the proportional relation of the number of the characteristic words corresponding to each sub-target word library is determined, determining the number of the target words required to be selected by the target word library of each category through the product of the scale coefficient N of the number of the characteristic words and the proportional relation. In one example, if the scale factor N is 200 if the scale relation of the target words selected from the three libraries is set to 2:1:1, then the number of target words selected from the three libraries is 400, 200, respectively. In specific application, the scale coefficient N can be determined by the accuracy of an algorithm adopted in subsequent machine learning, the algorithm of machine learning can be, for example, an SVM or decision tree algorithm, etc., when different machine learning algorithms are adopted, the output results of the number of different feature words will be different, the number of the feature words is obtained by adopting the method, the size of the scale coefficient N can be determined by setting different scale coefficients N, and the feature dimension of the feature word library can be further determined by the accuracy of the output results of the machine algorithm. Thus, the dimension of the feature word stock can be determined based on the output result of the machine learning algorithm, and the purpose of selecting the feature word stock with better effect can be realized more flexibly.
In the above detailed description of the method for extracting text feature words in this embodiment, first, extracting keywords from the text in each sub-sample library, where the extracting of keywords is based on the occurrence frequency of words and the semantic relationship of text context, the keywords can better characterize the features of the sample text, and the keywords are used as target words, so that the feature words determined according to word frequency can better embody the features of the text, obtain more effective feature words, and improve the value of the feature word library.
Example two
In this embodiment, before the target words in the sub-target word library are respectively ranked according to word frequencies, the sub-target word libraries are subjected to reduction processing so as to reduce the dimension of the target word library and further improve the effectiveness and the value of the feature words. In this embodiment, a portion different from that of the embodiment will be described with emphasis, and the same portion will not be described again.
Referring to fig. 2, in the method for performing feature engineering based on the sub-sample libraries including different categories, first, in step S201, extraction of keywords is performed on unstructured text in each sub-sample library to obtain sub-target word libraries of each sub-sample library.
Step S101 is the same as in the first embodiment.
In step S202, the sub-target word libraries are subjected to a reduction process.
The simplifying process in the step is to further process the target words in the target word library, the simplifying process can be one or more of rejecting or merging operation, after the simplifying process, the dimension of the target word library can be reduced, and further, the target words in the target word library can be more relevant to the sample text.
In some embodiments, the simplifying processing on each sub-target word library includes: and eliminating stop words from each sub-target word library.
Stop words refer to words that have no practical meaning, such as what, who, which, at, is in english, etc., such as seemingly, likely, who, you, what, etc. in the text. The stop words have certain occurrence frequency in the sample text and can be extracted as keywords, however, the stop words have no practical meaning, have low relevance with the text, and the target words which are the stop words in the target word library can be removed by matching the sub-target word library with the stop words. Redundant words can be effectively removed by eliminating stop words, the dimension of the feature word stock is reduced, the processing efficiency is improved, and the performance of the feature word stock is also improved.
In a specific application, the deactivated words in each sub-target word library may be removed by matching with a deactivated word library, where the deactivated word library may be an open-source deactivated word library, for example, an open-source deactivated word library of a search engine company such as hundred degrees.
When in machine learning, if the difference between the characteristic words obtained from each category is larger, the characteristics of each category can be reflected, and then more valuable characteristic words can be provided for machine learning, so that more accurate identification can be carried out during machine learning, and the better the difference can be reflected by the characteristic words among each category.
Because of the richness of the language, different expression modes are usually provided for the same semantic meaning, namely, a plurality of synonyms can express the same semantic meaning, the synonyms have redundancy, and meanwhile, if all the categories have the same synonym, the difference of characteristic words among the categories can be greatly reduced, and the value of a characteristic word bank is reduced. Based on this, in order to better embody the difference between the feature words and improve the performance of the feature word stock, in a more preferred embodiment, the simplifying processing of each sub-target word stock may include: and merging the synonyms in each sub-target word library, and deleting the synonyms shared among all the sub-target word libraries. The step of synonym processing may be performed after or before the rejection of stop words for each of the sub-target word libraries.
In specific application, according to different business requirements and synonym scales, synonym libraries or artificial distinction or combination of the two can be utilized to merge synonyms in each sub-target word library, and synonyms in the sub-target word libraries are merged into a fixed word representation, for example, lectures and lectures are merged into lectures, and the combination is beautiful and beautiful, so that synonym redundancy in the sub-target word libraries is eliminated, and synonyms among various classes are unified. And then deleting the synonyms shared among all the sub-target word libraries, for example, if target words of 'speech' exist in all the sub-target word libraries, the words cannot reflect the difference among different categories, are invalid characteristic words, deleting the words in all the sub-target word libraries of all the categories, and removing the same synonyms among the sub-target word libraries through transverse matching. The significance of the characteristic words is improved by eliminating the synonyms commonly existing among all sub-target word libraries, and meanwhile, the characteristic redundancy is reduced and the value of the characteristic word library is improved.
Therefore, the sub-target word library after the simplification processing is obtained, redundant words in the sub-target word library are removed, the dimension of the word library is reduced, and meanwhile, the words in the sub-target word library have better performance, so that the characteristic word library with higher value can be conveniently obtained through subsequent steps.
In step S203, the target words in each sub-target word library are sorted according to word frequency.
In step S204, according to the sorting result, the feature words are selected and determined from the sub-target word libraries to obtain a feature word library.
Step S203 and step S204 are the same as step S102 and step S103 in the first embodiment, and are not described herein.
In this embodiment, before the target words in the sub-target word library are respectively ranked according to word frequency, the sub-target word library is subjected to simplification processing, so that the dimension of the target word library can be reduced, the processing efficiency is improved, the characteristic words determined according to the word frequency can better embody the characteristics of the text, and the effectiveness and the value of the characteristic words are further improved.
In addition, referring to fig. 3, the application further provides a text feature word extracting device, where the device includes:
a sub-target word library obtaining unit 300, configured to extract keywords from texts in sub-sample libraries of different types in a sample library, and use the keywords as target words, so as to obtain sub-target word libraries of the sub-sample libraries;
a ranking unit 310, configured to rank the target words in each sub-target word library according to word frequencies;
and a feature word stock obtaining unit 320, configured to determine feature words from each sub-target word stock according to the ranking result, so as to obtain a feature word stock.
In some possible implementations of the present application, the method may further include:
and the rejecting unit is used for rejecting the stop words of each sub-target word library.
In some possible implementations of the present application, the method may further include:
and the synonym processing unit is used for merging synonyms in each sub-target word library, and then deleting the synonyms shared among all the sub-target word libraries.
In some possible implementations of the present application, in the feature word stock obtaining unit 320, the determining, according to the ranking result, feature words from each sub-target word stock to obtain a feature word stock includes:
selecting a preset number of target words from each sub-target word library as characteristic words according to the sequence of word frequency from high to low in the sequencing result so as to obtain a characteristic word library;
the method for determining the preset number of the characteristic words comprises the following steps: and determining the number of the characteristic words selected from each sub-target word library according to the preset proportional relation of the number of the characteristic words corresponding to the sub-target word library of each category and the scale coefficient of the number of the characteristic words.
In some possible implementations of the present application, the algorithm for extracting the keyword is a texttrank algorithm.
In some possible implementations of the present application, in the sub-target word library obtaining unit 300, the method for determining the number of keywords includes:
when the text length of the text is smaller than a first threshold value, setting the number of keywords to be a first number value;
setting the number of keywords to a second number value when the text length of the text is greater than a second threshold value, wherein the second threshold value is greater than the first threshold value;
when the text length of the text is between the first threshold and the second threshold, the number of keywords is proportional to the text length.
In addition, the embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores instructions, and when the instructions run on the terminal equipment, the terminal equipment is caused to execute the method for extracting the text feature words.
The embodiment of the application also provides a computer program product, which causes the terminal equipment to execute the text feature word extraction method when running on the terminal equipment.
It should be noted that, in the present description, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different manner from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the system or device disclosed in the embodiments, since it corresponds to the method disclosed in the embodiments, the description is relatively simple, and the relevant points refer to the description of the method section.
It is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The foregoing is merely a preferred embodiment of the present invention, and the present invention has been disclosed in the above description of the preferred embodiment, but is not limited thereto. Any person skilled in the art can make many possible variations and modifications to the technical solution of the present invention or modifications to equivalent embodiments using the methods and technical contents disclosed above, without departing from the scope of the technical solution of the present invention. Therefore, any simple modification, equivalent variation and modification of the above embodiments according to the technical substance of the present invention still fall within the scope of the technical solution of the present invention.
Claims (8)
1. A method for extracting text feature words, wherein the extraction is performed based on a sample library of text components, the sample library comprising sub-sample libraries of different categories, the method comprising:
extracting keywords from the texts in the sub-sample libraries, and taking the keywords as target words to obtain sub-target word libraries of the sub-sample libraries;
sorting the target words in each sub-target word library according to word frequency;
determining characteristic words from each sub-target word library according to the sorting result to obtain a characteristic word library;
determining feature words from each sub-target word library according to the sorting result to obtain a feature word library, wherein the method comprises the following steps:
selecting a preset number of target words from each sub-target word library as characteristic words according to the sequence of word frequency from high to low in the sequencing result to obtain a characteristic word library, wherein the characteristic word library is used as a data set during machine learning;
the method for determining the preset number of the characteristic words comprises the following steps: determining the number of the characteristic words selected from each sub-target word library according to the product of the proportional relation of the number of the characteristic words corresponding to the sub-target word library of each preset category and the scale coefficient of the number of the characteristic words, wherein the scale coefficient is determined according to the accuracy of an algorithm adopted in machine learning.
2. The method of claim 1, wherein after obtaining the sub-target word libraries, before sorting the target words in each of the sub-target word libraries according to word frequency, respectively, further comprising:
and eliminating stop words from each sub-target word library.
3. The method according to claim 1 or 2, wherein after obtaining the sub-target word libraries, before sorting the target words in each of the sub-target word libraries according to word frequency, respectively, further comprising:
and merging the synonyms in each sub-target word library, and deleting the synonyms shared among all the sub-target word libraries.
4. The method according to claim 1, wherein in the step of extracting keywords from the text in each sub-sample library, the method for determining the number of keywords includes:
when the text length of the text is smaller than a first threshold value, setting the number of keywords to be a first number value;
setting the number of keywords to a second number value when the text length of the text is greater than a second threshold value, wherein the second threshold value is greater than the first threshold value;
when the text length of the text is between the first threshold and the second threshold, the number of keywords is proportional to the text length.
5. A text feature word extraction device, the device comprising:
the sub-target word library acquisition unit is used for extracting keywords from texts in sub-sample libraries of different categories in the sample library, and taking the keywords as target words so as to obtain sub-target word libraries of the sub-sample libraries;
the sorting unit is used for sorting the target words in each sub-target word library according to word frequency;
the characteristic word stock obtaining unit is used for selecting a preset number of target words from the sub-target word stock as characteristic words according to the sequence of word frequency from high to low in the sequencing result so as to obtain a characteristic word stock, wherein the characteristic word stock is used as a data set during machine learning;
the determining process of the preset number of the characteristic words comprises the following steps: determining the number of the characteristic words selected from each sub-target word library according to the product of the proportional relation of the number of the characteristic words corresponding to the sub-target word library of each preset category and the scale coefficient of the number of the characteristic words, wherein the scale coefficient is determined according to the accuracy of an algorithm adopted in machine learning.
6. The apparatus as recited in claim 5, further comprising:
and the synonym processing unit is used for merging synonyms in each sub-target word library, and then deleting the synonyms shared among all the sub-target word libraries.
7. A computer readable storage medium, characterized in that instructions are stored in the computer readable storage medium, which instructions, when run on a terminal device, cause the terminal device to perform the method for extracting text feature words according to any one of claims 1-4.
8. A computer product, characterized in that it is a computer program which, when run on a terminal device, causes the terminal device to perform the method of extracting text feature words as claimed in any one of claims 1-4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811020415.XA CN109344397B (en) | 2018-09-03 | 2018-09-03 | Text feature word extraction method and device, storage medium and program product |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811020415.XA CN109344397B (en) | 2018-09-03 | 2018-09-03 | Text feature word extraction method and device, storage medium and program product |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109344397A CN109344397A (en) | 2019-02-15 |
CN109344397B true CN109344397B (en) | 2023-08-08 |
Family
ID=65296843
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811020415.XA Active CN109344397B (en) | 2018-09-03 | 2018-09-03 | Text feature word extraction method and device, storage medium and program product |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109344397B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111026864B (en) * | 2019-04-24 | 2024-02-20 | 广东小天才科技有限公司 | Dictation content determining method and device |
CN111340580B (en) * | 2020-02-05 | 2021-05-25 | 深圳市道旅旅游科技股份有限公司 | Method and device for determining house type, computer equipment and storage medium |
Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102779119A (en) * | 2012-06-21 | 2012-11-14 | 盘古文化传播有限公司 | Method and device for extracting keywords |
CN103473217A (en) * | 2012-06-08 | 2013-12-25 | 华为技术有限公司 | Method and device for extracting keywords from text |
CN103559310A (en) * | 2013-11-18 | 2014-02-05 | 广东利为网络科技有限公司 | Method for extracting key word from article |
CN103631858A (en) * | 2013-10-24 | 2014-03-12 | 杭州电子科技大学 | Science and technology project similarity calculation method |
CN104239300A (en) * | 2013-06-06 | 2014-12-24 | 富士通株式会社 | Method and device for excavating semantic keywords from text |
CN104572849A (en) * | 2014-12-17 | 2015-04-29 | 西安美林数据技术股份有限公司 | Automatic standardized filing method based on text semantic mining |
CN106156204A (en) * | 2015-04-23 | 2016-11-23 | 深圳市腾讯计算机系统有限公司 | The extracting method of text label and device |
CN106557508A (en) * | 2015-09-28 | 2017-04-05 | 北京神州泰岳软件股份有限公司 | A kind of text key word extracting method and device |
CN106682964A (en) * | 2016-12-29 | 2017-05-17 | 努比亚技术有限公司 | Method and apparatus for determining application label |
CN106897424A (en) * | 2017-02-24 | 2017-06-27 | 北京时间股份有限公司 | Information labeling system and method |
CN107038173A (en) * | 2016-02-04 | 2017-08-11 | 腾讯科技(深圳)有限公司 | Application query method and apparatus, similar application detection method and device |
CN107122352A (en) * | 2017-05-18 | 2017-09-01 | 成都四方伟业软件股份有限公司 | A kind of method of the extracting keywords based on K MEANS, WORD2VEC |
CN107967299A (en) * | 2017-11-03 | 2018-04-27 | 中国农业大学 | The hot word extraction method and system of a kind of facing agricultural public sentiment |
CN108182173A (en) * | 2017-12-27 | 2018-06-19 | 福建中金在线信息科技有限公司 | A kind of method, apparatus and electronic equipment for extracting keyword |
-
2018
- 2018-09-03 CN CN201811020415.XA patent/CN109344397B/en active Active
Patent Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103473217A (en) * | 2012-06-08 | 2013-12-25 | 华为技术有限公司 | Method and device for extracting keywords from text |
CN102779119A (en) * | 2012-06-21 | 2012-11-14 | 盘古文化传播有限公司 | Method and device for extracting keywords |
CN104239300A (en) * | 2013-06-06 | 2014-12-24 | 富士通株式会社 | Method and device for excavating semantic keywords from text |
CN103631858A (en) * | 2013-10-24 | 2014-03-12 | 杭州电子科技大学 | Science and technology project similarity calculation method |
CN103559310A (en) * | 2013-11-18 | 2014-02-05 | 广东利为网络科技有限公司 | Method for extracting key word from article |
CN104572849A (en) * | 2014-12-17 | 2015-04-29 | 西安美林数据技术股份有限公司 | Automatic standardized filing method based on text semantic mining |
CN106156204A (en) * | 2015-04-23 | 2016-11-23 | 深圳市腾讯计算机系统有限公司 | The extracting method of text label and device |
CN106557508A (en) * | 2015-09-28 | 2017-04-05 | 北京神州泰岳软件股份有限公司 | A kind of text key word extracting method and device |
CN107038173A (en) * | 2016-02-04 | 2017-08-11 | 腾讯科技(深圳)有限公司 | Application query method and apparatus, similar application detection method and device |
CN106682964A (en) * | 2016-12-29 | 2017-05-17 | 努比亚技术有限公司 | Method and apparatus for determining application label |
CN106897424A (en) * | 2017-02-24 | 2017-06-27 | 北京时间股份有限公司 | Information labeling system and method |
CN107122352A (en) * | 2017-05-18 | 2017-09-01 | 成都四方伟业软件股份有限公司 | A kind of method of the extracting keywords based on K MEANS, WORD2VEC |
CN107967299A (en) * | 2017-11-03 | 2018-04-27 | 中国农业大学 | The hot word extraction method and system of a kind of facing agricultural public sentiment |
CN108182173A (en) * | 2017-12-27 | 2018-06-19 | 福建中金在线信息科技有限公司 | A kind of method, apparatus and electronic equipment for extracting keyword |
Also Published As
Publication number | Publication date |
---|---|
CN109344397A (en) | 2019-02-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106649818B (en) | Application search intention identification method and device, application search method and server | |
Ladani et al. | Stopword identification and removal techniques on tc and ir applications: A survey | |
CN109471933B (en) | Text abstract generation method, storage medium and server | |
US7461056B2 (en) | Text mining apparatus and associated methods | |
CN108052500B (en) | Text key information extraction method and device based on semantic analysis | |
WO2015149533A1 (en) | Method and device for word segmentation processing on basis of webpage content classification | |
JP5710581B2 (en) | Question answering apparatus, method, and program | |
CN112347778A (en) | Keyword extraction method and device, terminal equipment and storage medium | |
Sabuna et al. | Summarizing Indonesian text automatically by using sentence scoring and decision tree | |
CN109472022B (en) | New word recognition method based on machine learning and terminal equipment | |
Zin et al. | The effects of pre-processing strategies in sentiment analysis of online movie reviews | |
CN107506472B (en) | Method for classifying browsed webpages of students | |
CN110019820B (en) | Method for detecting time consistency of complaints and symptoms of current medical history in medical records | |
CN110134777B (en) | Question duplication eliminating method and device, electronic equipment and computer readable storage medium | |
CN104484380A (en) | Personalized search method and personalized search device | |
Shawon et al. | Website classification using word based multiple n-gram models and random search oriented feature parameters | |
CN110705285B (en) | Government affair text subject word library construction method, device, server and readable storage medium | |
CN113468339A (en) | Label extraction method, system, electronic device and medium based on knowledge graph | |
Rathod | Extractive text summarization of Marathi news articles | |
CN109344397B (en) | Text feature word extraction method and device, storage medium and program product | |
CN110019670A (en) | A kind of text searching method and device | |
CN110728135A (en) | Text theme indexing method and device, electronic equipment and computer storage medium | |
WO2022105178A1 (en) | Keyword extraction method and related device | |
CN113806483A (en) | Data processing method and device, electronic equipment and computer program product | |
CN113407584A (en) | Label extraction method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |