CN106557508A - A kind of text key word extracting method and device - Google Patents

A kind of text key word extracting method and device Download PDF

Info

Publication number
CN106557508A
CN106557508A CN201510629350.9A CN201510629350A CN106557508A CN 106557508 A CN106557508 A CN 106557508A CN 201510629350 A CN201510629350 A CN 201510629350A CN 106557508 A CN106557508 A CN 106557508A
Authority
CN
China
Prior art keywords
word
text
participle
weight
key
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510629350.9A
Other languages
Chinese (zh)
Inventor
李国洋
王庆磊
梁德兴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Shenzhou Taiyue Software Co Ltd
Original Assignee
Beijing Shenzhou Taiyue Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Shenzhou Taiyue Software Co Ltd filed Critical Beijing Shenzhou Taiyue Software Co Ltd
Priority to CN201510629350.9A priority Critical patent/CN106557508A/en
Publication of CN106557508A publication Critical patent/CN106557508A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of text key word extracting method and device.Methods described includes:Participle is carried out to the sentence of pending text, the initial weight of each word after participle is set;According to each word after participle whether in keywords database, the initial weight is changed, the final weight after the change of each word after participle is obtained;The key word of the word as pending text of predetermined quantity is extracted according to the final weight of the word from the starting of weight highest word.Technical scheme is based in a large amount of short essays, with the presence of same subject short essay and short essay key word word frequency number more than 1 it is assumed that being arranged by the weight twice of the structure of keywords database, the word segmentation processing of pending text, word and the technological means such as descending sort based on term weighing reach the purpose that improves identification key word accuracy rate.

Description

A kind of text key word extracting method and device
Technical field
The present invention relates to Chinese natural language processing technology field, more particularly to a kind of text key word extraction Method and apparatus.
Background technology
Chinese text keyword extraction is the core word extractive technique carried out for article central idea, as The basic technology of text-processing, text key word are extracted and have evolved to very ripe stage, different subjects Thinking not of the same race is developed, including TF-IDF (term frequency-inverse document Frequency, positive word frequency-inverse word frequency) algorithm, TextRank (text sequence) algorithm, disaggregated model knowledge Not Deng main way, also non-master stream mode part of speech is inferred, with the part of speech compound mode that key word often occurs To infer whether a word is key word.
Wherein, the cardinal principle of TF-IDF algorithms extraction key word is:Using the positive word frequency of statistics and inverse word frequency Calculation extract key word, positive word frequency refers to the number of times that word occurs in article, and positive word frequency is bigger It is probably more key word, fraction is higher;Inverse word frequency refers to the number of times that word occurs in all articles, Inverse word frequency is bigger to be represented all articles and was all likely to occur, and is that the probability of key word is less, so inverse word The bigger fraction of frequency is lower;The key word frequency of occurrence of i.e. positive word frequency is directly proportional to fraction, and the pass of inverse word frequency Keyword frequency of occurrence is inversely proportional to fraction.TF-IDF extracts the advantage of key word to be that formula is simple, calculates speed Degree is fast, is more adapted to list main body article;Have the disadvantage the article to complex situations such as such as multi-threaded, short essays Extraction effect is not ideal enough.
TextRank algorithm is developed from PageRank algorithms, and PageRank algorithms are a pages Proposed algorithm, and TextRank algorithm is then a word proposed algorithm.TextRank algorithm is extracted crucial The realization approach of word is:One word is voted to adjacent word, if adjacent word throws 5 tickets to word A, Correspondingly, word A throws 5 tickets to the word, and finally, in article, the poll of the high word of word frequency can be higher, Word near high word frequency, poll also can be high.TextRank algorithm is compared with TF-IDF algorithm effects and connects Closely, pluses and minuses of the TextRank algorithm in keyword extraction are essentially identical with TF-IDF algorithms.
Disaggregated model extraction algorithm is to divide with aforesaid TF-IDF algorithms and TextRank algorithm difference It recognizing key word, is not to be based entirely on word frequency statisticses that class model extraction algorithm is based on disaggregated model, Thus identify that key word be very different with aforementioned TF-IDF algorithms and TextRank algorithm.Point Class model extraction algorithm is general to be used as training data by the artificial key word that extracts, and sets some features, The such as feature such as the position of key word, part of speech, word frequency, the beginning of usual article occur key word, noun The probability for becoming key word can be larger.Sorting algorithm has many kinds, and the extraction to key word also has very Big impact, disaggregated model extraction algorithm advantage are can to extract verb, adjective as key word, special Levy selection flexibly, different features can be set according to different textual forms, have the disadvantage to need artificial preparation Training data, cost of labor are higher, and as disaggregated model calculating process is unknowable, development process goes out Existing problem is difficult to solve.
At present, in terms of Chinese text keyword extraction, it is substantially all using big length list theme article as Experimental data, this big length article seem very complicated, but as which belongs to single theme, using TF-IDF This simple algorithm, it becomes possible to extract the key word of its central theme, and for chat record, although it is Short essay, but often the chat record of a period of time belongs to same subject, or several themes, it is only necessary to will Multiple short essay merging treatments, extract key word, still can be good at extracting the key word of central theme.
But when single short essay is run into, when describing several actions or thing, entirely extract the work of key word What is just become is very thorny, therefore needs a kind of effective method for extracting short essay text key word badly.
The content of the invention
In view of the above problems, the present invention propose one kind overcome the problems referred to above or at least in part solve on A kind of text key word extracting method and device of problem are stated, the technical scheme is that what is be achieved in that:
On the one hand, the present invention proposes a kind of text key word extracting method, and methods described includes:
Participle is carried out to the sentence of pending text, the initial weight of each word after participle is set;
According to each word after participle whether in the keywords database, the initial weight is changed, is obtained Final weight after the change of each word to after participle;
From the starting of weight highest word, the word for extracting predetermined quantity according to the final weight of the word is made For the key word of the pending text.
Preferably, build the keywords database in the following way in advance:
Key word is extracted to each history text in history text data, and calculates commenting for each key word Point;
Scoring is selected more than scoring threshold value and more than the key word for recommending frequency threshold value, the history text is built This keywords database;
The part of speech of each key word in the keywords database is recognized, the adjective in the keywords database is removed And modal particle, and term weighing is set for each key word according to part of speech;And according to pending text Corresponding business word is added or screened to application in the keywords database, and is the business word Term weighing is set.
Preferably, the sentence to pending text carries out participle, arranges each word after participle Initial weight, including:
After calculating participle, the scoring of each word, every according to the score calculation of each word after participle The initial weight of individual word.
It is further preferred that whether described each word according to after participle is in the advance keywords database for building In, the initial weight is changed, the final weight after the change of each word after participle is obtained, including:
Each word after participle is judged whether in the keywords database, if a certain word after participle exists In the keywords database, then the initial weight for changing the word is its initial weight and the word in the pass Term weighing sum in keyword storehouse, and the initial weight after change is set to the final weight of the word; If a certain word after participle does not change the initial weight of the word in the keywords database, or not and Initial weight is set to into the final weight of the word.
Preferably, it is described to extract predetermined number according to the final weight of the word from the starting of weight highest word Key word of the word of amount as the pending text, including:
The key word number of the pending text is determined according to the word quantity after the pending text participle Amount, when the word quantity is less than predetermined quantity, each word after participle is waited to locate as described The key word of reason text;
When the word quantity is more than predetermined quantity, to the word quantity, the business with positive integer K rounds, The key word quantity of the numerical value that obtains as the pending text will be rounded;And according to the final weight Each word after descending arrangement participle, extracts the word of the key word quantity from the starting of weight highest word Key word of the language as the pending text.
It is further preferred that when the word quantity is more than predetermined quantity, methods described also includes:
The ratio of noun in the key word of the pending text is determined according to text categories, according to noun Ratio extracts the word of the key word quantity as the pending text from the starting of weight highest word Key word.
On the other hand, the invention allows for a kind of text key word extraction element, described device includes:
Word segmentation processing unit, carries out participle for the sentence to pending text, arranges each after participle The initial weight of word;
Final weight computing unit, for according to each word after participle whether in the advance key for building In dictionary, the initial weight is changed, the final weight after the change of each word after participle is obtained;
Keyword extracting unit, for from the starting of weight highest word, according to the final weight of the word Extract the key word of the word as the pending text of predetermined quantity.
Preferably, described device also includes keywords database construction unit;
The keywords database construction unit, closes for extracting to each history text in history text data Keyword, and calculate the scoring of each key word;Select scoring more than scoring threshold value and be more than to recommend number of times threshold The key word of value, builds the keywords database of the history text;In recognizing the keywords database, each is crucial The part of speech of word, removes the adjective and modal particle in the keywords database, and crucial for each according to part of speech Word arranges term weighing;And phase is added according to the application of pending text in the keywords database The business word answered, and term weighing is set for the business word.
Preferably, the word segmentation processing unit, the scoring specifically for calculating each word after participle, root According to each word after participle score calculation described in each word initial weight;
The final weight computing unit, specifically for judging each word after participle whether in the pass In keyword storehouse, if a certain word after participle is in the keywords database, the initial power of the word is changed Weight is its initial weight and term weighing sum of the word in the keywords database, and by after change Initial weight is set to the final weight of the word;If a certain word after participle is not in the keywords database In, then the initial weight of the word is not changed, and initial weight is set to the final weight of the word.
Preferably, the keyword extracting unit, after according to the pending text participle Word quantity determines the key word quantity of the pending text, when the word quantity is less than predetermined quantity When, using each word after participle as the pending text key word;When the word quantity During more than predetermined quantity, to the word quantity, the business with positive integer K rounds, and will round the numerical value for obtaining As the key word quantity of the pending text;And arrange after participle according to the final weight descending Each word, extracts the word of the key word quantity as described pending from the starting of weight highest word The key word of text;
When the word quantity is more than predetermined quantity, the keyword extracting unit is additionally operable to according to text Classification determines the ratio of noun in the key word of the pending text, according to the ratio of noun from weight most The key word of the word as the pending text of the key word quantity is extracted in high word starting.
The beneficial effect of the embodiment of the present invention is:The present invention has same subject based in a large amount of short essays Short essay exist and short essay key word word frequency number more than 1 it is assumed that building with regard to history text in advance Keywords database, and respective weights are set for each key word in keywords database according to part of speech;Treat in extraction When processing the key word of text, participle is carried out to the sentence of pending text, each word after participle is obtained Initial weight;Whether the initial weight of word is changed in keywords database according to word, each word is obtained Final weight after language change;By extracting true according to the final weight of word from the starting of weight highest word The key word of the pending text of fixed number amount.Compared to prior art, invention significantly increases crucial The accuracy rate that word is extracted.
Description of the drawings
Fig. 1 is the flow chart of text key word extracting method provided in an embodiment of the present invention;
Fig. 2 is text key word extraction element structural representation provided in an embodiment of the present invention.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with accompanying drawing to this Bright embodiment is described in further detail.
First, make a brief description to being related to technical term in specific embodiment:
Word frequency:The number of times that one word occurs in an article is referred to as word frequency.
Part of speech:Basic concept in semantic analysis, general word can be divided into noun, verb, adjective, Various parts of speech such as adverbial word.
Weight:The general different attention degrees for being shown by setting weight in the search to different terms, more Crucial term weighing is higher, otherwise more unessential term weighing is lower, is set to zero, directly ignores.
Participle:Natural language processing sentence needs for sentence to split into word and is processed, and sentence splits into The process of word is exactly participle.
The present invention global design thought be:Each word for short essay word frequency information is few, in most short essays Only occur once, and rely on the key word that word frequency information is extracted to be difficult to obtain preferable effect, and classification The situation of the wretched insufficiency of model extraction key word, present invention assumes that in a large amount of short essays, having same subject Short essay exist, and assume short essay key word word frequency number be more than 1.Based on aforementioned it is assumed that for big Amount history text data, extract a collection of key word and set up keywords database, according to keywords database and by right Pending text carries out the technological means such as word segmentation processing, weight calculation and descending sort, intercepts pending The key word of text.
Embodiment one:
The present embodiment is built in keywords database, and the keywords database before text key word is extracted in advance Each key word arranges respective weights.
Specifically, build the keywords database in the following way in advance:
Key word is extracted to each history text in history text data, and calculates commenting for each key word Point;
Select scoring more than scoring threshold value and more than the key word for recommending frequency threshold value, build history text Scoring threshold value, for example with 0~100 point of scoring rule, is set to 80 points, will be pushed away by keywords database Recommend frequency threshold value and be set to 10 times;
The part of speech of each key word in identification keywords database, removes the adjective and/or the tone in keywords database Word, and term weighing is set for each key word according to part of speech;And according to the application neck of pending text Corresponding business word is added or screens in domain in keywords database, and arranges term weighing for business word.
In actual applications, the present embodiment adopts TF-IDF and/or TextRank algorithm to each history text When originally carrying out keyword extraction, algorithm can score to each word, it is assumed that algorithm enters to each word The scoring of row 0~100, after above-mentioned scoring process is carried out to all of history text, the present embodiment screening Go out word of the fraction more than 80, word of the number of times more than 10 times will be recommended to list in keywords database.Then Using FNLP instruments, the part of speech of each key word in keywords database is identified, arrange corresponding according to part of speech Weight, the weight of noun is set to into 5 for example, the weight of verb is set to 4, the weight of other parts of speech It is set to 3.Thus, the present embodiment completes the structure of keywords database, but meet above-mentioned condition Some conventional adjective, modal particles are also possible that in key word, need to reject from keywords database.
Additionally, can also set a batch traffic noun for no application, this kind of word is generally non- It is often important, its weight can be set to 6.
After keywords database has been built, you can carry out keyword extraction to pending text.Such as Fig. 1 institutes Show, Fig. 1 is the flow chart of text key word extracting method provided in an embodiment of the present invention, the side in Fig. 1 Method includes:
S100, carries out participle to the sentence of pending text, arranges the initial power of each word after participle Weight.
This step can carry out word segmentation processing to the sentence of pending text using prior art, for example with IKanalyzer participles instrument carries out participle to sentence.
Wherein, participle is carried out to the sentence of pending text, the initial power of each word after participle is set Include again:
The scoring of each word after participle is calculated using TF-IDF algorithms and/or TextRank algorithm, will be divided The initial weight of score calculation each word of each word after word.For example, when algorithm is to each word When carrying out 0~100 scoring, can just word scoring conversion between 0~5.
S200, whether according to each word after participle in the advance keywords database for building, change is initial Weight, obtains the final weight after the change of each word after participle.
Wherein, according to each word after participle whether in keywords database, initial weight is changed, is obtained Final weight after the change of each word after participle includes:
Each word after participle is judged whether in keywords database, if a certain word after participle is in key In dictionary, then the initial weight for changing the word is its initial weight and word of the word in keywords database Language weight sum, and the initial weight after change is set to the final weight of the word;If after participle A certain word does not then change the initial weight of the word not in the keywords database, and by initial weight It is set to the final weight of the word.
S300, makees from the starting of weight highest word according to the word that the final weight of word extracts predetermined quantity For the key word of pending text.
Wherein, make from the starting of weight highest word according to the word that the final weight of word extracts predetermined quantity Key word for pending text includes:
The key word quantity of pending text is determined according to the word quantity after pending text participle, works as word Language quantity be less than predetermined quantity when, using each word after participle as pending text key word;
When word quantity is more than predetermined quantity, to word quantity, the business with positive integer K rounds, and will round Key word quantity of the numerical value for obtaining as pending text;And according to final weight descending arrangement point Each word after word, from weight highest word starting extract key word quantity word as described treating from The key word of reason text.
Assume that acquiescence at least extracts 4 key words, when the word lazy weight that pending text participle is obtained When 4, each word that participle is obtained is as the key word of pending text;When pending text When the word quantity that participle is obtained is more than 4, then the key word quantity of pending text can be according to following Formula is calculated:
Pending text key word quantity=[word quantity/K]
Wherein, for word number is obtained after pending text participle, K is the positive integer of setting to word quantity, [] is accorded with for rounding operation.
In actual applications, the present embodiment also passes through the ratio of noun in the key word for set pending text, To improve ability to express of the key word to text centric theme.
Specifically, in step S300, when the word quantity is more than predetermined quantity, in Fig. 1 Method also includes:
The ratio of noun in the key word of pending text is determined according to text categories, according to the ratio of noun The key word of the word as pending text of key word quantity is extracted from the starting of weight highest word.
Present invention is particularly suitable for the keyword extraction to short essay text.
Embodiments of the invention based in a large amount of short essays, with the presence of the pass of the short essay and short essay of same subject Supposed premise of the keyword word frequency number more than 1, by the keywords database for building history text in advance, and according to Part of speech is that each key word in keywords database arranges respective weights.In the key word for extracting pending text When, carry out word segmentation processing first to the sentence of pending text, obtain the initial of each word after participle Weight;Then according to whether the initial weight of the word after processing participle of classifying in keywords database obtains every The final weight of individual word;Process finally by sequence, obtain the key word of pending text.
Embodiment two:
The present embodiment extracts the key word of a short essay text using the technical scheme provided in embodiment one, and Result will be extracted to be contrasted with prior art, so that technical scheme is more intuitively presented Superiority.
The present embodiment gives a short essay text, and the key of the short essay text is obtained beforehand through artificial means Word is used as the standard of comparison.
The method provided during the embodiment of the present invention one is respectively adopted and TF-IDF key words of the prior art The keyword extracting method of extracting method, TextRank keyword extracting methods and disaggregated model is to above-mentioned short Text carries out keyword extraction, and the key word that various methods are extracted is tested and assessed, evaluating result As shown in table 1:
Table 1
Keyword extracting method Accuracy rate
TF-IDF 50.1%
TF-IDF 49.8%
Disaggregated model 48.0%
Technical scheme 64.0%
The key word extracted by technical scheme be can be seen that by the data in table 1 Accuracy rate is far above TF-IDF keyword extracting methods of the prior art, TextRank keyword extractions The keyword extracting method of method and disaggregated model, although 64% accuracy rate is not unusual height, but phase Accuracy rate than key word 50% or so is extracted in prior art, technical scheme is by short essay text The accuracy rate of this keyword extraction improves a lot, in actual project is applied the invention to, will More significant effect can be obtained.
Embodiment three:
Based on above-mentioned text key word extracting method identical technology design, the embodiment of the present invention three also carries A kind of text key word extraction element is supplied.
Fig. 2 is the text key word extraction element structural representation that the embodiment of the present invention three is provided, in Fig. 2 Device include:
Word segmentation processing unit 21, carries out participle for the sentence to pending text, arranges every after participle The initial weight of individual word.
Wherein, word segmentation processing unit 21, specifically for using TF-IDF and/or TextRank algorithm meter The scoring of each word after point counting word, each word according to the score calculation of each word after participle Initial weight.
Final weight computing unit 22, for according to each word after participle whether in the advance pass for building In keyword storehouse, the initial weight is changed, the final weight after the change of each word after participle is obtained.
Wherein, final weight computing unit 22, specifically for judging each word after participle whether in institute State in keywords database, if a certain word after participle is in the keywords database, change the first of the word Beginning weight is its initial weight and term weighing sum of the word in the keywords database, and will be changed Initial weight afterwards is set to the final weight of the word;If a certain word after participle is not in the key The initial weight of the word in dictionary, is not then changed, and initial weight is set to the final power of the word Weight.
Keyword extracting unit 23, for from the starting of weight highest word, according to the final power of the word Bring up again the key word for the word of predetermined quantity being taken as pending text.
Wherein, keyword extracting unit 23, specifically for according to the word after the pending text participle Quantity determines the key word quantity of the pending text, when the word quantity is less than predetermined quantity, Using each word after participle as the pending text key word;
When the word quantity is more than predetermined quantity, to the word quantity, the business with positive integer K rounds, The key word quantity of the numerical value that obtains as the pending text will be rounded;And according to final weight descending Each word after arrangement participle, extracts the word conduct of the key word quantity from weight highest word The key word of the pending text;
When the word quantity is more than predetermined quantity, the keyword extracting unit is additionally operable to according to text This classification determines the ratio of noun in the key word of the pending text, according to the ratio of noun from weight The key word of the word as the pending text of the key word quantity is extracted in the starting of highest word.
Wherein, the device in Fig. 2 also include keywords database construction unit, the keywords database construction unit, For to each history text in history text data, being carried using TF-IDF and/or TextRank algorithm Key word is taken, and calculates the scoring of each key word;Select to score more than scoring threshold value and secondary more than recommending The key word of number threshold value, builds the keywords database of the history text;Recognize each in the keywords database The part of speech of key word, removes the adjective and/or modal particle in keywords database, and closes for each according to part of speech Keyword arranges term weighing;And in keywords database add or sieve according to the application of pending text Corresponding business word is selected, and term weighing is set for business word.
In sum, the invention discloses a kind of text key word extracting method and device, the present invention is based on In a large amount of short essays, with the presence of the vacation that the key word word frequency number of the short essay and short essay of same subject is more than 1 If building the keywords database with regard to history text in advance, and it is each pass in keywords database according to part of speech Keyword arranges respective weights;When the key word of pending text is extracted, the sentence of pending text is entered Row participle, obtains the initial weight of each word after participle;Whether changed in keywords database according to word The initial weight of word, obtains the final weight after each word change;By from weight highest word Beginning extracts the key word of the pending text of quantification according to the final weight of word.Compared to existing skill Art, invention significantly increases the accuracy rate of keyword extraction.
Presently preferred embodiments of the present invention is the foregoing is only, the protection model of the present invention is not intended to limit Enclose.All any modification, equivalent substitution and improvements made within the spirit and principles in the present invention etc., Comprising within the scope of the present invention.

Claims (10)

1. a kind of text key word extracting method, it is characterised in that methods described includes:
Participle is carried out to the sentence of pending text, the initial weight of each word after participle is set;
According to each word after participle whether in the advance keywords database for building, the initial power is changed Weight, obtains the final weight after the change of each word after participle;
From the starting of weight highest word, the word for extracting predetermined quantity according to the final weight of the word is made For the key word of the pending text.
2. text key word extracting method according to claim 1, it is characterised in that using as follows Mode builds the keywords database in advance:
Key word is extracted to each history text in history text data, and calculates commenting for each key word Point;
Scoring is selected more than scoring threshold value and more than the key word for recommending frequency threshold value, the history text is built This keywords database;
The part of speech of each key word in the keywords database is recognized, the adjective in the keywords database is removed And modal particle, and term weighing is set for each key word according to part of speech;And according to pending text Corresponding business word is added or screened to application in the keywords database, and is the business word Term weighing is set.
3. text key word extracting method according to claim 1, it is characterised in that described to treat The sentence for processing text carries out participle, arranges the initial weight of each word after participle, including:
After calculating participle, the scoring of each word, every according to the score calculation of each word after participle The initial weight of individual word.
4. text key word extracting method according to claim 3, it is characterised in that the basis Whether each word after participle in the advance keywords database for building changes the initial weight, obtains Final weight after the change of each word after participle, including:
Each word after participle is judged whether in the keywords database, if a certain word after participle exists In the keywords database, then the initial weight for changing the word is its initial weight and the word in the pass Term weighing sum in keyword storehouse, and the initial weight after change is set to the final weight of the word; If a certain word after participle does not change the initial weight of the word in the keywords database, or not and Initial weight is set to into the final weight of the word.
5. text key word extracting method according to claim 1, it is characterised in that described from power The word starting of weight highest, the word for extracting predetermined quantity according to the final weight of the word are treated as described The key word of text is processed, including:
The key word number of the pending text is determined according to the word quantity after the pending text participle Amount, when the word quantity is less than predetermined quantity, each word after participle is waited to locate as described The key word of reason text;
When the word quantity is more than predetermined quantity, to the word quantity, the business with positive integer K rounds, The key word quantity of the numerical value that obtains as the pending text will be rounded;And according to the final weight Each word after descending arrangement participle, extracts the word of the key word quantity from the starting of weight highest word Key word of the language as the pending text.
6. text key word extracting method according to claim 5, it is characterised in that when institute's predicate When language quantity is more than predetermined quantity, methods described also includes:
The ratio of noun in the key word of the pending text is determined according to text categories, according to noun Ratio extracts the word of the key word quantity as the pending text from the starting of weight highest word Key word.
7. a kind of text key word extraction element, it is characterised in that described device includes:
Word segmentation processing unit, carries out participle for the sentence to pending text, arranges each after participle The initial weight of word;
Final weight computing unit, for according to each word after participle whether in the advance key for building In dictionary, the initial weight is changed, the final weight after the change of each word after participle is obtained;
Keyword extracting unit, for from weight highest word, the final weight begun according to the word Extract the key word of the word as the pending text of predetermined quantity.
8. text key word extraction element according to claim 7, it is characterised in that described device Also include keywords database construction unit;
The keywords database construction unit, closes for extracting to each history text in history text data Keyword, and calculate the scoring of each key word;Select scoring more than scoring threshold value and be more than to recommend number of times threshold The key word of value, builds the keywords database of the history text;In recognizing the keywords database, each is crucial The part of speech of word, removes the adjective and modal particle in the keywords database, and crucial for each according to part of speech Word arranges term weighing;And added in the keywords database according to the application of pending text or Corresponding business word is screened, and term weighing is set for the business word.
9. text key word extraction element according to claim 7, it is characterised in that
The word segmentation processing unit, the scoring specifically for calculating each word after participle, after participle Each word score calculation described in each word initial weight;
The final weight computing unit, specifically for judging each word after participle whether in the pass In keyword storehouse, if a certain word after participle is in the keywords database, the initial power of the word is changed Weight is its initial weight and term weighing sum of the word in the keywords database, and by after change Initial weight is set to the final weight of the word;If a certain word after participle is not in the keywords database In, then the initial weight of the word is not changed, and initial weight is set to the final weight of the word.
10. text key word extraction element according to claim 7, it is characterised in that the pass Keyword extraction unit, specifically for treating according to the word quantity determination after the pending text participle The key word quantity of text is processed, when the word quantity is less than predetermined quantity, by each after participle Key word of the word all as the pending text;
When the word quantity is more than predetermined quantity, to the word quantity, the business with positive integer K rounds, The key word quantity of the numerical value that obtains as the pending text will be rounded;And according to the final weight Each word after descending arrangement participle, extracts the word of the key word quantity from weight highest word As the key word of the pending text;
When the word quantity is more than predetermined quantity, the keyword extracting unit is additionally operable to according to text This classification determines the ratio of noun in the key word of the pending text, according to the ratio of noun from weight The key word of the word as the pending text of the key word quantity is extracted in the starting of highest word.
CN201510629350.9A 2015-09-28 2015-09-28 A kind of text key word extracting method and device Pending CN106557508A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510629350.9A CN106557508A (en) 2015-09-28 2015-09-28 A kind of text key word extracting method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510629350.9A CN106557508A (en) 2015-09-28 2015-09-28 A kind of text key word extracting method and device

Publications (1)

Publication Number Publication Date
CN106557508A true CN106557508A (en) 2017-04-05

Family

ID=58416684

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510629350.9A Pending CN106557508A (en) 2015-09-28 2015-09-28 A kind of text key word extracting method and device

Country Status (1)

Country Link
CN (1) CN106557508A (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107402960A (en) * 2017-06-15 2017-11-28 成都优易数据有限公司 A kind of inverted index optimized algorithm based on the weighting of the semantic tone
CN108549626A (en) * 2018-03-02 2018-09-18 广东技术师范学院 A kind of keyword extracting method for admiring class
CN108664473A (en) * 2018-05-11 2018-10-16 平安科技(深圳)有限公司 Recognition methods, electronic device and the readable storage medium storing program for executing of text key message
CN108776657A (en) * 2018-06-13 2018-11-09 湖南正宇软件技术开发有限公司 CPPCC's motion focus extraction method
CN109344397A (en) * 2018-09-03 2019-02-15 东软集团股份有限公司 The extracting method and device of text feature word, storage medium and program product
CN109819128A (en) * 2019-01-23 2019-05-28 平安科技(深圳)有限公司 A kind of quality detecting method and device of telephonograph
CN110705279A (en) * 2018-07-10 2020-01-17 株式会社理光 Vocabulary selection method and device and computer readable storage medium
CN111046169A (en) * 2019-12-24 2020-04-21 东软集团股份有限公司 Method, device and equipment for extracting subject term and storage medium
WO2020107864A1 (en) * 2018-11-30 2020-06-04 华为技术有限公司 Information processing method, device, service equipment and computer readable storage medium
CN111797214A (en) * 2020-06-24 2020-10-20 深圳壹账通智能科技有限公司 FAQ database-based problem screening method and device, computer equipment and medium
CN112101017A (en) * 2020-04-02 2020-12-18 上海迷因网络科技有限公司 Method for generating questions for rapid expressive force test
CN112101005A (en) * 2020-04-02 2020-12-18 上海迷因网络科技有限公司 Method for generating and dynamically adjusting quick expressive force test questions
CN113010648A (en) * 2021-04-15 2021-06-22 联仁健康医疗大数据科技股份有限公司 Content search method, content search device, electronic equipment and storage medium
CN113360782A (en) * 2021-06-07 2021-09-07 武汉理工大学 Internet data-oriented food safety risk identification method and system
CN113641801A (en) * 2021-10-19 2021-11-12 成都中航信虹科技股份有限公司 Control method and system of voice scheduling system and electronic equipment
CN113743090A (en) * 2021-09-08 2021-12-03 度小满科技(北京)有限公司 Keyword extraction method and device
CN114722162A (en) * 2022-06-10 2022-07-08 南京英诺森软件科技有限公司 Feature type determining method and device, electronic equipment and storage medium
CN116936135A (en) * 2023-09-19 2023-10-24 北京珺安惠尔健康科技有限公司 Medical big health data acquisition and analysis method based on NLP technology

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101315624A (en) * 2007-05-29 2008-12-03 阿里巴巴集团控股有限公司 Text subject recommending method and device
CN102033919A (en) * 2010-12-07 2011-04-27 北京新媒传信科技有限公司 Method and system for extracting text key words
CN104573054A (en) * 2015-01-21 2015-04-29 杭州朗和科技有限公司 Information pushing method and equipment
CN104615593A (en) * 2013-11-01 2015-05-13 北大方正集团有限公司 Method and device for automatic detection of microblog hot topics
CN104731797A (en) * 2013-12-19 2015-06-24 北京新媒传信科技有限公司 Keyword extracting method and keyword extracting device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101315624A (en) * 2007-05-29 2008-12-03 阿里巴巴集团控股有限公司 Text subject recommending method and device
CN102033919A (en) * 2010-12-07 2011-04-27 北京新媒传信科技有限公司 Method and system for extracting text key words
CN104615593A (en) * 2013-11-01 2015-05-13 北大方正集团有限公司 Method and device for automatic detection of microblog hot topics
CN104731797A (en) * 2013-12-19 2015-06-24 北京新媒传信科技有限公司 Keyword extracting method and keyword extracting device
CN104573054A (en) * 2015-01-21 2015-04-29 杭州朗和科技有限公司 Information pushing method and equipment

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107402960A (en) * 2017-06-15 2017-11-28 成都优易数据有限公司 A kind of inverted index optimized algorithm based on the weighting of the semantic tone
CN107402960B (en) * 2017-06-15 2020-11-10 成都优易数据有限公司 Reverse index optimization algorithm based on semantic mood weighting
WO2019165678A1 (en) * 2018-03-02 2019-09-06 广东技术师范学院 Keyword extraction method for mooc
CN108549626A (en) * 2018-03-02 2018-09-18 广东技术师范学院 A kind of keyword extracting method for admiring class
CN108664473A (en) * 2018-05-11 2018-10-16 平安科技(深圳)有限公司 Recognition methods, electronic device and the readable storage medium storing program for executing of text key message
WO2019214149A1 (en) * 2018-05-11 2019-11-14 平安科技(深圳)有限公司 Text key information identification method, electronic device, and readable storage medium
CN108776657A (en) * 2018-06-13 2018-11-09 湖南正宇软件技术开发有限公司 CPPCC's motion focus extraction method
CN110705279A (en) * 2018-07-10 2020-01-17 株式会社理光 Vocabulary selection method and device and computer readable storage medium
CN109344397A (en) * 2018-09-03 2019-02-15 东软集团股份有限公司 The extracting method and device of text feature word, storage medium and program product
CN109344397B (en) * 2018-09-03 2023-08-08 东软集团股份有限公司 Text feature word extraction method and device, storage medium and program product
WO2020107864A1 (en) * 2018-11-30 2020-06-04 华为技术有限公司 Information processing method, device, service equipment and computer readable storage medium
CN109819128A (en) * 2019-01-23 2019-05-28 平安科技(深圳)有限公司 A kind of quality detecting method and device of telephonograph
CN111046169A (en) * 2019-12-24 2020-04-21 东软集团股份有限公司 Method, device and equipment for extracting subject term and storage medium
CN111046169B (en) * 2019-12-24 2024-03-26 东软集团股份有限公司 Method, device, equipment and storage medium for extracting subject term
CN112101005A (en) * 2020-04-02 2020-12-18 上海迷因网络科技有限公司 Method for generating and dynamically adjusting quick expressive force test questions
CN112101005B (en) * 2020-04-02 2022-08-30 上海迷因网络科技有限公司 Method for generating and dynamically adjusting quick expressive force test questions
CN112101017A (en) * 2020-04-02 2020-12-18 上海迷因网络科技有限公司 Method for generating questions for rapid expressive force test
CN112101017B (en) * 2020-04-02 2022-09-06 上海迷因网络科技有限公司 Method for generating questions for rapid expressive force test
CN111797214A (en) * 2020-06-24 2020-10-20 深圳壹账通智能科技有限公司 FAQ database-based problem screening method and device, computer equipment and medium
CN113010648A (en) * 2021-04-15 2021-06-22 联仁健康医疗大数据科技股份有限公司 Content search method, content search device, electronic equipment and storage medium
CN113360782A (en) * 2021-06-07 2021-09-07 武汉理工大学 Internet data-oriented food safety risk identification method and system
CN113743090A (en) * 2021-09-08 2021-12-03 度小满科技(北京)有限公司 Keyword extraction method and device
CN113743090B (en) * 2021-09-08 2024-04-12 度小满科技(北京)有限公司 Keyword extraction method and device
CN113641801A (en) * 2021-10-19 2021-11-12 成都中航信虹科技股份有限公司 Control method and system of voice scheduling system and electronic equipment
CN114722162B (en) * 2022-06-10 2022-08-26 南京英诺森软件科技有限公司 Feature type determination method and device, electronic equipment and storage medium
CN114722162A (en) * 2022-06-10 2022-07-08 南京英诺森软件科技有限公司 Feature type determining method and device, electronic equipment and storage medium
CN116936135A (en) * 2023-09-19 2023-10-24 北京珺安惠尔健康科技有限公司 Medical big health data acquisition and analysis method based on NLP technology
CN116936135B (en) * 2023-09-19 2023-11-24 北京珺安惠尔健康科技有限公司 Medical big health data acquisition and analysis method based on NLP technology

Similar Documents

Publication Publication Date Title
CN106557508A (en) A kind of text key word extracting method and device
Park et al. One-step and two-step classification for abusive language detection on twitter
CN106598944B (en) A kind of civil aviaton's security public sentiment sentiment analysis method
CN107609121A (en) Newsletter archive sorting technique based on LDA and word2vec algorithms
CN106055673B (en) A kind of Chinese short text sensibility classification method based on text feature insertion
CN106294320B (en) A kind of terminology extraction method and system towards academic paper
CN107122340B (en) A kind of similarity detection method of the science and technology item return based on synonym analysis
Akaichi Social networks' Facebook'statutes updates mining for sentiment classification
CN108829799A (en) Based on the Text similarity computing method and system for improving LDA topic model
CN109446404A (en) A kind of the feeling polarities analysis method and device of network public-opinion
CN106528532A (en) Text error correction method and device and terminal
CN104573013A (en) Category weight combined integrated learning classifying method
CN107273348B (en) Topic and emotion combined detection method and device for text
CN110472203B (en) Article duplicate checking and detecting method, device, equipment and storage medium
CN104317965A (en) Establishment method of emotion dictionary based on linguistic data
CN112527958A (en) User behavior tendency identification method, device, equipment and storage medium
CN107341142B (en) Enterprise relation calculation method and system based on keyword extraction and analysis
CN112632982A (en) Dialogue text emotion analysis method capable of being used for supplier evaluation
CN108108346A (en) The theme feature word abstracting method and device of document
CN109214445A (en) A kind of multi-tag classification method based on artificial intelligence
CN106484672A (en) Vocabulary recognition methods and vocabulary identifying system
Ardhana et al. Classification of Javanese Language Level on Articles Using Multinomial Naive Bayes and N-Gram Methods
CN113806483A (en) Data processing method and device, electronic equipment and computer program product
CN106055614A (en) Similarity analysis method of content similarities based on multiple semantic abstracts
CN108614825B (en) Webpage feature extraction method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170405