CN103631963A - Keyword optimization processing method and device based on big data - Google Patents

Keyword optimization processing method and device based on big data Download PDF

Info

Publication number
CN103631963A
CN103631963A CN201310696077.2A CN201310696077A CN103631963A CN 103631963 A CN103631963 A CN 103631963A CN 201310696077 A CN201310696077 A CN 201310696077A CN 103631963 A CN103631963 A CN 103631963A
Authority
CN
China
Prior art keywords
text message
character
keyword
character string
kernel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310696077.2A
Other languages
Chinese (zh)
Other versions
CN103631963B (en
Inventor
裴向宇
田传钊
王汉生
李红波
常莹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Learned Cube Of Beijing Science And Technology Ltd
Original Assignee
Learned Cube Of Beijing Science And Technology Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Learned Cube Of Beijing Science And Technology Ltd filed Critical Learned Cube Of Beijing Science And Technology Ltd
Priority to CN201310696077.2A priority Critical patent/CN103631963B/en
Publication of CN103631963A publication Critical patent/CN103631963A/en
Application granted granted Critical
Publication of CN103631963B publication Critical patent/CN103631963B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0241Advertisements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Abstract

The invention discloses a keyword optimization processing method and a keyword optimization processing device based on big data. The method comprises the steps of arranging all pieces of text information to be processed in sequence, and splitting the text information into single words; removing the single words with set frequency according to the frequency of each single word, and combining the residual single words to form character strings; extracting core keywords from all the combined character strings. According to the keyword optimization processing method and the keyword optimization processing device based on big data, the problems of low correctness of determined keywords and high cost of determination of the keywords are solved; the technical effects of improving the correctness of the determined keywords and reducing the cost of determination of the keywords can be achieved.

Description

A kind of keyword optimized treatment method and device based on large data
Technical field
The embodiment of the present invention relates to microcomputer data processing, relates in particular to a kind of keyword optimized treatment method and device based on large data.
Background technology
Paid search advertisement is most important advertisement putting mode on current internet.If all enterprises advertising budget is on the internet denoted as to 100%, paid search advertisement aspect drops into and occupies more than 50% share.At home, main release platform is such as there being Baidu's Extension Software Platform etc.
The realization mechanism of paid search advertisement is to determine keyword to be put by advertisement putting person, and intention recommendation information corresponding to keyword and link advertisement webpage etc.Advertisement putting person buys keyword to be put from paid search advertising service business, when browsing user while inputting retrieval type, will search corresponding intention recommendation information and link advertisement webpage by mating with keyword, for user, browses and clicks.Search engine system can record the data such as the amount of representing, click volume, for carrying out charging according to setting rule.
Based on above-mentioned mechanism, for advertisement putting person, a successful paid search advertisement need to complete following important step:
The first, choose correct keyword.For example Yi Ge nash-equilibrium mechanism, should buy " aviation passenger ticket ", and " electronic passenger ticket " etc. can mate the keyword of its business, and completely irrelevant keyword of industry that similar " baby milk " is engaged in it is like this inapplicable.The second, the simple and clear and attractive intention recommendation information of keyword writing for buying, with the concern that attracts clients, promotes ad click rate, and then promotes keyword quality degree.Three, for rational best bid and matching way etc. set in each keyword.
Wherein, choose correct keyword particularly important, keyword to be put can constantly be revised and increase newly, and prior art is that the manually judgement by experience etc. is upgraded to promoting the newly-increased mode of keyword.Main dependence promotes to industry and paid advertisement the personnel that all know quite well, or veteran consultant extracts industry kernel keyword and opens up word, to opening up word result, manually filter, divide into groups, the popularization of then reaching the standard grade, does further screening by effect to keyword.Specifically, a typical optimizing process can be summarized as follows: first, consultant can open up word for selecting kernel keyword according to oneself experience and related service knowledge; Then, according to related service knowledge, to opening up word result, manually filter, delete and self think incoherent keyword; Next, keyword grouping is reached the standard grade, if keyword brings a large amount of invalid costs, delete this keyword.
But existing have following shortcoming based on manual type processing keyword process:
The first, due to this method, mainly rely on people's subjective judgement, be easy to occur for same keyword, different consultants to industry kernel keyword, open up the filtration of word result and the suggestion of grouping is not consistent.This makes the quality of popularization seriously be limited to consultant's professional skill level and the understanding to industry, if consultant understands not industry, is easy to cause a large amount of invalid costs.
The second, by semantic mode, selected kernel keyword, carried out keyword filtration and grouping, result is more accurate, because this is the result to true semantic analysis.But but to consume a large amount of time costs:
(1) consultant needs the understanding rule of thumb with to relevant industries, and according to the existing keyword extraction industry kernel keyword of account, this can spend a lot of time of consultant;
(2) according to kernel keyword, open up word, open up word result generally more, consultant carries out analysis and filter, grouping to keyword one by one, can spend a large amount of valuable time of consultant.
(3) keyword comprising in the promoted account of large enterprise may reach 100,000 or 1,000,000 magnitudes, when account scale surpasses to a certain degree, select the work of account core word surmounted manpower can and scope, the keyword that need to increase when account surpasses when a certain amount of, manually keyword is filtered and divide into groups also to seem unable to do what one wishes.
Summary of the invention
The embodiment of the present invention provides a kind of keyword optimized treatment method and device based on large data, the accuracy of definite keyword to improve, and reduce the cost of determining keyword.
On the one hand, the embodiment of the present invention provides a kind of keyword optimized treatment method based on large data, comprising:
Each pending text message is sequentially arranged, and be split as individual character;
According to the frequency of each individual character, the individual character of setpoint frequency is removed, and remaining individual character is merged into character string;
From each character string merging, extract kernel keyword.
On the other hand, the embodiment of the present invention also provides a kind of keyword optimization process device based on large data, comprising:
Single-character splitting module, for each pending text message is sequentially arranged, and is split as individual character;
Character string merges module, for according to the frequency of each individual character, the individual character of setpoint frequency being removed, and remaining individual character is merged into character string;
Keyword extracting module, for extracting kernel keyword from each character string merging.
The embodiment of the present invention is by each pending text message is sequentially arranged, and is split as individual character; According to the frequency of each individual character, the individual character of setpoint frequency is removed, and remaining individual character is merged into character string; From each character string merging, extract kernel keyword, solve and determine that the accuracy of keyword is low, and the high problem of the cost of definite keyword, realize and improve the accuracy of determining keyword, and reduce the technique effect of the cost of determining keyword.
Accompanying drawing explanation
Fig. 1 is the schematic flow sheet of a kind of keyword optimized treatment method based on large data of providing in the embodiment of the present invention one;
Fig. 2 is the schematic flow sheet of a kind of keyword optimized treatment method based on large data of providing in the embodiment of the present invention two;
Fig. 3 is the schematic flow sheet of a kind of keyword optimized treatment method based on large data of providing in the embodiment of the present invention three;
Fig. 4 is the schematic flow sheet of a kind of keyword optimized treatment method based on large data of providing in the embodiment of the present invention four;
Fig. 5 is the structural representation of a kind of keyword optimization process device based on large data of providing in the embodiment of the present invention five;
Fig. 6 is the structural representation of a kind of keyword optimization process device based on large data of providing in the embodiment of the present invention six.
Embodiment
Below in conjunction with drawings and Examples, the present invention is described in further detail.Be understandable that, specific embodiment described herein is only for explaining the present invention, but not limitation of the invention.It also should be noted that, for convenience of description, in accompanying drawing, only show part related to the present invention but not entire infrastructure.
Embodiment mono-
Fig. 1 is the schematic flow sheet of a kind of keyword optimized treatment method based on large data of providing in the embodiment of the present invention one, and this disposal route can be carried out by the keyword optimization process device based on large data, as shown in Figure 1, comprises the following steps:
Step S101, sequentially arranges each pending text message, and is split as individual character.
Each pending text message can be initial while throwing in keyword at the beginning of fixed a plurality of text messages, can be also when follow-up interpolation keyword, the keyword of original input in account.
In this step, first text message is split as to individual character, so that subsequent treatment.Each pending text message is sequentially arranged, and the operation that is split as individual character preferably includes: each pending text message is sequentially arranged, between each text message, blank character is set; According to blank character, each text message is split as to individual character.
Need to describe, each text message can be the character string that comprises letter, numeral, Chinese character and symbol combination in any.Concrete, individual character can comprise a letter, a numeral, a Chinese character or a symbol.
Step S102, removes the individual character of setpoint frequency according to the frequency of each individual character, and remaining individual character is merged into character string;
This step is removed according to the frequency of individual character, and the frequency of individual character is the appearance ratio of each individual character in all individual characters, for example, occur altogether 100 individual characters, and wherein the individual character of 10 appearance is identical, and the frequency of this individual character is 10%.This step can be removed the too high or too low individual character of the frequency of occurrences, concrete setpoint frequency value can be as required or experience arrange.The character string that meets certain rule merged in remaining individual character, thereby some individual characters too uncommon or redundancy are fallen in screening.In step S102, according to the frequency of each individual character, the individual character of setpoint frequency is removed, and remaining individual character is merged into character string can specifically comprise:
First, according to the frequency of each individual character, the individual character of setpoint frequency is removed, the individual character of removal is replaced with blank character.Then, in remaining individual character, the continuous individual character between blank character is merged into a character string.
Step S103 extracts kernel keyword from each character string merging.
Extracting the operation of kernel keyword can carry out according to setting rule, because remaining character string has been passed through the filtering screening of individual character, so remaining character string itself has been the character string that the frequency of occurrences is higher, in particular cases can all extract as kernel keyword.
Preferred extraction operates, and extracts kernel keyword and specifically can comprise from each character string merging:
From each character string merging, the character string that the quantity of character is less than to setting threshold is deleted; Thereby the character string that can reserved character quantity be greater than setting threshold, wherein, setting threshold can be positive integer, for example 1, thus can delete character quantity the character string that is 1.In fact this operation will only have the character string of an individual character to delete.
In remaining character string, extract a character string that frequency is the highest as kernel keyword.
From each pending text message, with blank character, replace kernel keyword, the operation that repeat above-mentioned fractionation individual character, merges character string and extract kernel keyword.
Need to describe, can be also in remaining character string, once extracts and meet a plurality of kernel keywords of setting highest frequency.But the operation that adopts above-mentioned circulation to extract, can be so that extract a kernel keyword at every turn, residue text message is just no longer subject to the interference of this kernel keyword, and can continue therein to extract other kernel keywords, and such accuracy rate is higher.
By the disclosed a kind of keyword optimized treatment method based on large data of the embodiment of the present invention one, wherein so that the Screening Treatment of individual character is carried out to automatic fitration text message, reduce even without artificial intervention, automatically complete, also can be applicable to the processing of mass text information.Can improve for accuracy a large amount of, that complex data is determined keyword, and reduce the cost of determining keyword.
Embodiment bis-
The embodiment of the present invention two, based on the disclosed a kind of keyword optimized treatment method based on large data of the embodiment of the present invention one, provides a kind of preferred embodiment of keyword optimized treatment method, as shown in Figure 2, comprises the steps:
Step S201, suppose in text message, specifically to comprise following phrase, each text message is sequentially arranged, and between each text message, blank character is set "! ", as follows:
A Nokia's mobile phone! A Samsung mobile phone! An i Phone! Nokia's mobile phone how! A smart mobile phone! Samsung S4! Samsung S3! Samsung S2! I Phone how! Samsung mobile phone is handy! Nokia's mobile phone is handy! The large screen mobile phone of intelligence
Step S202, all phrases in above-mentioned text message are splitted into individual character, split result is:
A Nokia's mobile phone! A Samsung mobile phone! An i Phone! Nokia's mobile phone how! A smart mobile phone! Samsung S4! Samsung S3! Samsung S2! I Phone how! Samsung mobile phone is handy! Nokia's mobile phone is handy! The large screen mobile phone of intelligence
Step S203, by the sub-average blank character for character of the frequency of individual character "! " replace, this mean value is the mean value of all individual character frequencies, replaces result to be:
! ! ! A mobile phone! A Samsung mobile phone! ! ! A mobile phone! ! ! ! A mobile phone! ! ! ! ! ! A mobile phone! ! ! ! ! A Samsung! ! ! A Samsung! ! ! A Samsung! ! ! ! ! A mobile phone! ! ! ! A Samsung mobile phone! ! ! ! ! ! ! A mobile phone! ! ! ! ! ! ! ! Mobile phone
Step S204, retain the phrase that in above-mentioned phrase, number of characters is greater than 1, result is:
A mobile phone! A Samsung mobile phone! A mobile phone! A mobile phone! A mobile phone! A Samsung! A Samsung! A Samsung! A mobile phone! A Samsung mobile phone! A mobile phone! Mobile phone
Step S205, the highest character string of the extraction frequency of occurrences, wherein occur that the word that frequency is the highest is " mobile phone ", occurs 7 times, extracts kernel keyword " mobile phone " herein.
Step S206, in urtext information, remove " mobile phone ", with blank character, replace, result is:
A Nokia! A Samsung! An apple! A Nokia! How! Intelligence! Samsung S4! Samsung S3! Samsung S2! An apple! How! A Samsung! Handy! A Nokia! Handy! Intelligence large-size screen monitors!
Step S207, repetition above-mentioned steps S202-S206, extracting the highest character string of frequency is " Samsung ", occurs 5 times, extracts kernel keyword " Samsung " herein.
Step S208, in original text information, remove " Samsung ", with blank character, replace, result is:
A Nokia! An apple! A Nokia! How! Intelligence! S4! S3! S2! An apple! How! Handy! A Nokia! Handy! Intelligence large-size screen monitors!
Step S209, extract frequency wherein the highest be IPHONE, occur 4 times, extract kernel keyword " IPHONE " herein.
Can repeat aforesaid operations, until obtain the kernel keyword of setting quantity, or highest frequency setting threshold.In this example, kernel keyword extracts result and is: mobile phone, Samsung, IPHONE, Nokia.
A kind of keyword optimized treatment method based on large data providing by the embodiment of the present invention two can correctly extract keyword from phrase, improves the accuracy of determining keyword, and reduces the cost of determining keyword.
Embodiment tri-
The process flow diagram of the keyword optimized treatment method based on large data that Fig. 3 provides for the embodiment of the present invention three, the present embodiment be take previous embodiment as basis, and the application scenarios after a kind of kernel keyword extracts is provided.In the application process of paid search advertisement, can upgrade according to advertising results the keyword of input, need first to determine newly-increased text message, more therefrom screen key word and throw in, the present embodiment can be determined newly-increased keyword by the kernel keyword based on having thrown in account.As shown in Figure 3, on previous embodiment basis, extract kernel keyword from each character string merging after, further comprising the steps of:
Step S301, the text message that deletion does not comprise kernel keyword from newly-increased text message;
Step S302, in remaining each text message, determine the appearance ratio of non-core keyword and kernel keyword, and deletion ratio is lower than the text message of preset proportion value, with the text message after being filtered.
Illustrating text message as follows, newly-increased is:
Samsung, Nokia's mobile phone expensive, Samsung mobile phone OK, cell phone number, Nokia's mobile phone how, the large screen mobile phone of Samsung.
From newly-increased text message, delete kernel keyword, the definite kernel keyword of previous examples is mobile phone, Samsung, IPHONE, Nokia, all comprises kernel keyword.But wherein visible, the ratio of the kernel keyword occurring in " cell phone number " is lower, if lower than preset proportion value, is deleted filtration.Result after filtration is: Samsung, Nokia's mobile phone expensive, Samsung mobile phone OK, Nokia's mobile phone how, the large screen mobile phone of Samsung.Result after filtration can be used as the foundation of throwing in newly-increased keyword, or directly as throwing in keyword.
In such scheme, preferably, after the text message after being filtered, also comprise:
Step S303, extract each and filter after kernel keyword in text message, be defined as the label of text message;
Step S304, the text message after each being filtered according to label divide into groups.
Still by above-mentioned example, illustrate, after filtering, the corresponding situation of the label of text message is as follows:
Samsung---Samsung
Nokia's mobile phone is expensive---Nokia+mobile phone
Samsung mobile phone OK---Samsung+mobile phone
Nokia's mobile phone is how---Nokia+mobile phone
The large screen mobile phone of Samsung---Samsung+mobile phone
Above-mentioned label has three kinds: Samsung, Nokia+mobile phone, Samsung+mobile phone, can be divided into text message three groups accordingly.Keyword after grouping is easier to divide into groups throw in.
The process of newly-increased keyword can repeatedly be carried out, when newly-increased keyword is devoted in account, when increase keyword next time, can to the keyword in account, re-start the extraction of kernel keyword, then according to kernel keyword, increase again the screening of keyword newly.
Embodiment tetra-
The process flow diagram of the keyword optimized treatment method based on large data that Fig. 4 provides for the embodiment of the present invention four, the present embodiment be take previous embodiment as basis, application scenarios after another kind of kernel keyword extracts is provided, can have identified the susceptibility between keyword and attribute.Each pending text message is sequentially being arranged, and is also being comprised before being split as individual character:
Step S401, according to the attribute of pending text message, text message is classified, form the pending text message of at least two groups;
Arranging of attribute can complete according to demand, and the attribute of text message can be the corresponding technical field of text message, region, time limit, personage and event.Preferably according to intention recommendation information, classify.One of them example is the data such as the amount of representing that can feed back from advertising service business and click volume, to determine the prioritization of each intention recommendation information, or be categorized as more excellent and poor intention recommendation information.The corresponding keyword of sorted intention recommendation information, is the pending text message that meets this attribute.
Step S402, each pending text message is sequentially arranged, and be split as individual character;
Step S403, according to the frequency of each individual character, the individual character of setpoint frequency is removed, and remaining individual character is merged into character string;
Step S404, from each character string merging, extract kernel keyword.
Above-mentioned steps S402-404 can carry out with reference to previous embodiment, and the pending text message of every group is carried out respectively.
Whether step S405, the kernel keyword of respectively organizing pending text message identical, by different kernel keywords be defined as this organize pending text message the kernel keyword of corresponding attribute.
If every group of kernel keyword difference that attribute is corresponding, illustrates that this different kernel keyword more can represent the difference of two groups of attributes.For example, may be to make the more responsive keyword of intention recommendation information difference, the weighted value of these keywords can be set, to do the reference frame of throwing in.
By the disclosed keyword optimized treatment method based on large data of the embodiment of the present invention, can automatically realize the extraction of Core attributes keyword, and extraction cost is low, reliability is high.
Embodiment five
The embodiment of the present invention five provides a kind of keyword optimization process device based on large data, as shown in Figure 5, specifically comprises: single-character splitting module 51, character string merge module 52 and keyword extracting module 53.
Wherein, single-character splitting module 51 is for each pending text message is sequentially arranged, and is split as individual character; Character string merges module 52 for according to the frequency of each individual character, the individual character of setpoint frequency being removed, and remaining individual character is merged into character string; Keyword extracting module 53 is for extracting kernel keyword from each character string merging.
In such scheme, single-character splitting module 51 can specifically comprise: blank character setting unit 511 and split cells 512.Blank character setting unit 511, for each pending text message is sequentially arranged, arranges blank character between each text message; Split cells 512, for being split as individual character according to blank character by each text message.
Character string merges module 52 and can specifically comprise: blank character replacement unit 521 and merge cells 522.Blank character replacement unit 521, for the individual character of setpoint frequency being removed according to the frequency of each individual character, the individual character of removal is replaced with blank character; Merge cells 522, at remaining individual character, merges into a character string by the continuous individual character between blank character.
Keyword extraction module 53 can specifically comprise: character string delete cells 531 and extraction unit 532.Wherein, character string delete cells 531, for each character string from merging, the character string that the quantity of character is less than to setting threshold is deleted; Extraction unit 532, in remaining character string, extracts a character string that frequency is the highest as kernel keyword.
Described device also can comprise: repeat module 533, for extract a character string that frequency is the highest as kernel keyword after, from each pending text message, with blank character, replace kernel keyword, trigger the operation that repeats above-mentioned fractionation individual character, merges character string and extract kernel keyword.
By the disclosed a kind of keyword optimization process device based on large data of the embodiment of the present invention five, can improve the accuracy of determining keyword, and reduce the cost of determining keyword.
Embodiment six
The embodiment of the present invention six provides a kind of keyword optimization process device based on large data, as shown in Figure 6, comprise: single-character splitting module 61, character string merge module 62 and keyword extracting module 63, also comprise: text message removing module 64, for after each character string merging is extracted kernel keyword, from newly-increased text message, delete the text message that does not comprise kernel keyword;
Filter text message module 65, at remaining each text message, determine the appearance ratio of non-core keyword and kernel keyword, and deletion ratio is lower than the text message of preset proportion value, with the text message after being filtered.
Label determination module 66, after the text message after being filtered, extracts each and filters the kernel keyword in rear text message, is defined as the label of text message;
Grouping module 67, divides into groups for the text message after each being filtered according to label.
Said apparatus can be realized for the what's new of throwing in keyword.
Or, in this device, can also comprise text information processing module, for each pending text message is sequentially being arranged, and before being split as individual character, according to the attribute of pending text message, text message is classified, form the pending text message of at least two groups;
Kernel keyword determination module, be used for after each character string merging is extracted kernel keyword, whether the kernel keyword of respectively organizing pending text message identical, by different kernel keywords be defined as this organize pending text message the kernel keyword of corresponding attribute.
A kind of keyword optimization process device based on large data providing by the embodiment of the present invention five, can, by correct extraction key word in newly-increased text message, add in original key word group.
The said goods can be carried out the method that any embodiment of the present invention provides, and possesses the corresponding functional module of manner of execution and beneficial effect.
The technical scheme of the embodiment of the present invention, in conjunction with statistical knowledge and text mining knowledge, the individual character of selected certain frequency restores in original text information, has simplified and has considered that word forms the complexity of mechanism etc., for the realization of the method provides the foundation; In word discovery procedure, taken into account frequency and the simple positional information of text, the Exact Travelling of finding for word provides assurance; In word selection course, only get the word that the frequency of occurrences is the highest at every turn, take that circulative metabolism is continual to be chosen, the interference of uncontrollable factor is dropped to minimum, improved the accuracy that word is found.
Embodiment of the present invention scheme is processed keyword process compared to existing manual type, and advantages and benefits are:
The first, for the extraction of kernel keyword, the filtration of keyword and grouping standard are unified, there is not situation about varying with each individual.Algorithm can be analyzed for the relevant text message of each promoted account, the kernel keyword and the promoted account that extract are closely related, reduced to a great extent not understand the deviation of bringing, the very large facility that unified filtration and packet mode also bring the follow-up optimization of promoted account to promoting industry etc.;
The second,, in processing keyword process, waste time and energy core word extraction, filtration, the grouping process that even can not complete of manpower comparing completes by algorithm automatic learning, saved the time of consultant's preciousness.
Note, above are only preferred embodiment of the present invention and institute's application technology principle.Skilled person in the art will appreciate that and the invention is not restricted to specific embodiment described here, can carry out for a person skilled in the art various obvious variations, readjust and substitute and can not depart from protection scope of the present invention.Therefore, although the present invention is described in further detail by above embodiment, the present invention is not limited only to above embodiment, in the situation that not departing from the present invention's design, can also comprise more other equivalent embodiment, and scope of the present invention is determined by appended claim scope.

Claims (10)

1. the keyword optimized treatment method based on large data, is characterized in that, comprising:
Each pending text message is sequentially arranged, and be split as individual character;
According to the frequency of each individual character, the individual character of setpoint frequency is removed, and remaining individual character is merged into character string;
From each character string merging, extract kernel keyword.
2. method according to claim 1, is characterized in that, described each pending text message is sequentially arranged, and is split as individual character and specifically comprises:
Each pending text message is sequentially arranged, between each text message, blank character is set;
According to described blank character, each text message is split as to individual character;
Describedly according to the frequency of each individual character, the individual character of setpoint frequency is removed, and remaining individual character is merged into character string specifically comprises:
According to the frequency of each individual character, the individual character of setpoint frequency is removed, the individual character of removal is replaced with blank character;
In remaining individual character, the continuous individual character between blank character is merged into a character string.
3. method according to claim 2, is characterized in that, describedly from each character string merging, extracts kernel keyword and specifically comprises:
From each character string merging, the character string that the quantity of character is less than to setting threshold is deleted;
In remaining character string, extract a character string that frequency is the highest as kernel keyword;
Extract a character string that frequency is the highest as kernel keyword after, described method also comprises:
From each pending text message, with blank character, replace described kernel keyword, the operation that repeat above-mentioned fractionation individual character, merges character string and extract kernel keyword.
4. according to the method described in claim 1-3 any one, it is characterized in that, described from each character string merging, extract kernel keyword after, also comprise:
From newly-increased text message, delete the text message that does not comprise kernel keyword;
In remaining each text message, determine the appearance ratio of non-core keyword and kernel keyword, and deletion ratio is lower than the text message of preset proportion value, with the text message after being filtered.
5. method according to claim 4, is characterized in that, after the text message after described filtration, also comprises:
Extract each and filter the kernel keyword in rear text message, be defined as the label of described text message;
Text message after each being filtered according to label divides into groups.
6. method according to claim 1, is characterized in that, described, each pending text message is sequentially arranged, and is also comprised before being split as individual character:
According to the attribute of pending text message, text message is classified, form the pending text message of at least two groups;
Described each character string from merging, extract kernel keyword after, also comprise:
Whether the kernel keyword of respectively organizing pending text message identical, by different kernel keywords be defined as this organize pending text message the kernel keyword of corresponding attribute.
7. the keyword optimization process device based on large data, is characterized in that, comprising:
Single-character splitting module, for each pending text message is sequentially arranged, and is split as individual character;
Character string merges module, for according to the frequency of each individual character, the individual character of setpoint frequency being removed, and remaining individual character is merged into character string;
Keyword extracting module, for extracting kernel keyword from each character string merging.
8. device according to claim 7, is characterized in that, described single-character splitting module comprises:
Blank character setting unit, for each pending text message is sequentially arranged, arranges blank character between each text message;
Split cells, for being split as individual character according to described blank character by each text message;
Described character string merges module and comprises:
Blank character replacement unit, for the individual character of setpoint frequency being removed according to the frequency of each individual character, the individual character of removal is replaced with blank character;
Merge cells, at remaining individual character, merges into a character string by the continuous individual character between blank character;
Described keyword extraction module comprises:
Character string delete cells, for each character string from merging, the character string that the quantity of character is less than to setting threshold is deleted;
Extraction unit, in remaining character string, extracts a character string that frequency is the highest as kernel keyword;
Described device also comprises: repeat module, for extract a character string that frequency is the highest as kernel keyword after, from each pending text message, with blank character, replace described kernel keyword, trigger the operation that repeats above-mentioned fractionation individual character, merges character string and extract kernel keyword;
Described device also comprises: text message removing module, for after each character string merging is extracted kernel keyword, from newly-increased text message, delete the text message that does not comprise kernel keyword;
Filter text message module, at remaining each text message, determine the appearance ratio of non-core keyword and kernel keyword, and deletion ratio is lower than the text message of preset proportion value, with the text message after being filtered.
9. device according to claim 8, is characterized in that, also comprises:
Label determination module, after the text message after being filtered, extracts each and filters the kernel keyword in rear text message, is defined as the label of described text message;
Grouping module, divides into groups for the text message after each being filtered according to label.
10. device according to claim 7, is characterized in that, also comprises:
Text information processing module, for described, each pending text message sequentially being arranged, and before being split as individual character, classifies text message according to the attribute of pending text message, forms the pending text message of at least two groups;
Kernel keyword determination module, after extracting kernel keyword in described each character string from merging, whether the kernel keyword of respectively organizing pending text message identical, by different kernel keywords be defined as this organize pending text message the kernel keyword of corresponding attribute.
CN201310696077.2A 2013-12-18 2013-12-18 A kind of keyword optimized treatment method and device based on big data Expired - Fee Related CN103631963B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310696077.2A CN103631963B (en) 2013-12-18 2013-12-18 A kind of keyword optimized treatment method and device based on big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310696077.2A CN103631963B (en) 2013-12-18 2013-12-18 A kind of keyword optimized treatment method and device based on big data

Publications (2)

Publication Number Publication Date
CN103631963A true CN103631963A (en) 2014-03-12
CN103631963B CN103631963B (en) 2017-10-17

Family

ID=50213004

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310696077.2A Expired - Fee Related CN103631963B (en) 2013-12-18 2013-12-18 A kind of keyword optimized treatment method and device based on big data

Country Status (1)

Country Link
CN (1) CN103631963B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105706088A (en) * 2016-01-31 2016-06-22 深圳市博信诺达经贸咨询有限公司 Big data application method and system
CN106033403A (en) * 2015-03-20 2016-10-19 广州金山移动科技有限公司 Text transition method and device
CN104063370B (en) * 2014-07-01 2017-09-22 北京博雅立方科技有限公司 A kind of intelligent packet method and device based on keyword
CN108538300A (en) * 2018-02-27 2018-09-14 科大讯飞股份有限公司 Sound control method and device, storage medium, electronic equipment
CN109949806A (en) * 2019-03-12 2019-06-28 百度国际科技(深圳)有限公司 Information interacting method and device
CN110069676A (en) * 2017-09-28 2019-07-30 北京国双科技有限公司 Keyword recommendation method and device
CN112000794A (en) * 2020-07-30 2020-11-27 北京百度网讯科技有限公司 Text corpus screening method and device, electronic equipment and storage medium
CN113538062A (en) * 2021-07-28 2021-10-22 福州果集信息科技有限公司 Method for reversely deducing bid words purchased by commodity promotion notes

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079027A (en) * 2007-06-27 2007-11-28 腾讯科技(深圳)有限公司 Chinese character word distinguishing method and system
CN101122900A (en) * 2007-09-25 2008-02-13 中兴通讯股份有限公司 Words partition system and method
CN101477566A (en) * 2009-01-19 2009-07-08 腾讯科技(深圳)有限公司 Method and apparatus used for putting candidate key words advertisement
CN101625683A (en) * 2008-07-09 2010-01-13 精实万维软件(北京)有限公司 Method for selecting bidding advertisement keyword during release of search engine bidding advertisement
CN102156721A (en) * 2011-03-29 2011-08-17 张栋 Method for accurately delivering Internet video advertisement based on label
CN102169496A (en) * 2011-04-12 2011-08-31 清华大学 Anchor text analysis-based automatic domain term generating method
JP2012256268A (en) * 2011-06-10 2012-12-27 Ad Space Co Ltd Advertisement distribution device and advertisement distribution program
CN103092956A (en) * 2013-01-17 2013-05-08 上海交通大学 Method and system for topic keyword self-adaptive expansion on social network platform

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079027A (en) * 2007-06-27 2007-11-28 腾讯科技(深圳)有限公司 Chinese character word distinguishing method and system
CN101122900A (en) * 2007-09-25 2008-02-13 中兴通讯股份有限公司 Words partition system and method
CN101625683A (en) * 2008-07-09 2010-01-13 精实万维软件(北京)有限公司 Method for selecting bidding advertisement keyword during release of search engine bidding advertisement
CN101477566A (en) * 2009-01-19 2009-07-08 腾讯科技(深圳)有限公司 Method and apparatus used for putting candidate key words advertisement
CN102156721A (en) * 2011-03-29 2011-08-17 张栋 Method for accurately delivering Internet video advertisement based on label
CN102169496A (en) * 2011-04-12 2011-08-31 清华大学 Anchor text analysis-based automatic domain term generating method
JP2012256268A (en) * 2011-06-10 2012-12-27 Ad Space Co Ltd Advertisement distribution device and advertisement distribution program
CN103092956A (en) * 2013-01-17 2013-05-08 上海交通大学 Method and system for topic keyword self-adaptive expansion on social network platform

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104063370B (en) * 2014-07-01 2017-09-22 北京博雅立方科技有限公司 A kind of intelligent packet method and device based on keyword
CN106033403A (en) * 2015-03-20 2016-10-19 广州金山移动科技有限公司 Text transition method and device
CN106033403B (en) * 2015-03-20 2019-05-31 广州金山移动科技有限公司 A kind of text conversion method and device
CN105706088A (en) * 2016-01-31 2016-06-22 深圳市博信诺达经贸咨询有限公司 Big data application method and system
CN110069676A (en) * 2017-09-28 2019-07-30 北京国双科技有限公司 Keyword recommendation method and device
CN108538300A (en) * 2018-02-27 2018-09-14 科大讯飞股份有限公司 Sound control method and device, storage medium, electronic equipment
CN108538300B (en) * 2018-02-27 2021-01-29 科大讯飞股份有限公司 Voice control method and device, storage medium and electronic equipment
CN109949806A (en) * 2019-03-12 2019-06-28 百度国际科技(深圳)有限公司 Information interacting method and device
CN112000794A (en) * 2020-07-30 2020-11-27 北京百度网讯科技有限公司 Text corpus screening method and device, electronic equipment and storage medium
CN112000794B (en) * 2020-07-30 2023-08-22 北京百度网讯科技有限公司 Text corpus screening method and device, electronic equipment and storage medium
CN113538062A (en) * 2021-07-28 2021-10-22 福州果集信息科技有限公司 Method for reversely deducing bid words purchased by commodity promotion notes

Also Published As

Publication number Publication date
CN103631963B (en) 2017-10-17

Similar Documents

Publication Publication Date Title
CN103631963A (en) Keyword optimization processing method and device based on big data
US10546005B2 (en) Perspective data analysis and management
CN106682169B (en) Application label mining method and device, application searching method and server
CN103336766B (en) Short text garbage identification and modeling method and device
CN103123624B (en) Determine method and device, searching method and the device of centre word
JP6394388B2 (en) Synonym relation determination device, synonym relation determination method, and program thereof
CN104504150A (en) News public opinion monitoring system
CN103488635A (en) Method and device for acquiring product information
CN101950309A (en) Subject area-oriented method for recognizing new specialized vocabulary
CN102591475A (en) Content input method and system for online editor
CN103092943A (en) Method of advertisement dispatch and advertisement dispatch server
CN110334268B (en) Block chain project hot word generation method and device
CN103150331A (en) Method and device for providing search engine tags
CN104036004A (en) Search error correction method and search error correction device
CN105488206B (en) A kind of Android application evolution recommended method based on crowdsourcing
KR101803150B1 (en) Important precedents extraction and sorting method using Big Data
CN110990587A (en) Enterprise relation discovery method and system based on topic model
CN102591897A (en) Apparatus and method for searching document
US10042913B2 (en) Perspective data analysis and management
Khemani et al. A review on reddit news headlines with nltk tool
CN102103604B (en) Method and device for determining core weight of term
CN110727850B (en) Network information filtering method, computer readable storage medium and mobile terminal
CN102902737A (en) Automatic collecting and screening method for network images
KR102041915B1 (en) Database module using artificial intelligence, economic data providing system and method using the same
CN103678400A (en) web page classification method and device based on groupization behaviors

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20171017

Termination date: 20211218

CF01 Termination of patent right due to non-payment of annual fee