CN113032683B - Method for quickly segmenting words in network popularization - Google Patents

Method for quickly segmenting words in network popularization Download PDF

Info

Publication number
CN113032683B
CN113032683B CN202110469657.2A CN202110469657A CN113032683B CN 113032683 B CN113032683 B CN 113032683B CN 202110469657 A CN202110469657 A CN 202110469657A CN 113032683 B CN113032683 B CN 113032683B
Authority
CN
China
Prior art keywords
words
word segmentation
keywords
effective
roots
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN202110469657.2A
Other languages
Chinese (zh)
Other versions
CN113032683A (en
Inventor
李勤义
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Maize Society Shenzhen Network Technology Co ltd
Original Assignee
Maize Society Shenzhen Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Maize Society Shenzhen Network Technology Co ltd filed Critical Maize Society Shenzhen Network Technology Co ltd
Priority to CN202110469657.2A priority Critical patent/CN113032683B/en
Publication of CN113032683A publication Critical patent/CN113032683A/en
Application granted granted Critical
Publication of CN113032683B publication Critical patent/CN113032683B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

The invention discloses a method for quickly segmenting words in network popularization, which comprises the steps that a user inputs keywords, a word segmentation system automatically excavates all long tail words containing the keywords according to the keywords input by the user and stores the long tail words as txt files; the word segmentation system reads all long end words from the txt file, performs word segmentation, breaks up all long end words, extracts keywords with high occurrence frequency and summarizes high-frequency word roots, and returns the keywords to the user; the user reserves effective words according to the high-frequency root; screening out effective roots according to the reserved effective words; and the word segmentation system performs word segmentation according to the screened effective root words and derives an xls word segmentation table. According to the invention, when the same type of keywords are exported by the word segmentation system, the keywords are automatically grouped according to the length of the characters, so that better popularization is achieved, the word segmentation result is exported to the xls file to the local by one key, the problems of low word segmentation speed and keyword omission are solved, the time efficiency of screening effective words from a large number of keywords and carrying out classification and integration by enterprises is improved, and the working efficiency and the result are improved.

Description

Method for quickly segmenting words in network popularization
Technical Field
The invention relates to the technical field of computers, in particular to a method for quickly segmenting words in network popularization.
Background
With the increasing of enterprise transformation internet network marketing promotion, the method of enterprises in network promotion and the keywords in paying promotion need to be more accurate and effective, and how to screen effective keywords from tens of thousands, hundreds of thousands and millions of keywords is the problem that enterprises need to consider firstly when in network promotion, and how to perform keyword classification combination after screening effective keywords is also the problem that enterprises are very painful, and if effective words cannot be screened and word segmentation is performed according to the attributes of different words, the enterprises can cause great waste in the promotion process.
At present, word segmentation is basically performed through traditional manual word segmentation, which common words, such as factory words, price words, model words, scene words and the like, need to be found out from all long-end words at first in the traditional manual word segmentation, on one hand, different industries need to be very proficient to know which word roots exist in the long-end words to be segmented, and the method is tedious, time-consuming and energy-consuming, is easy to omit keywords, and needs a more convenient method capable of improving word segmentation speed.
In the operation process of traditional manual word segmentation, word-by-word classification is needed, if a core keyword has hundreds of thousands of long-tailed words, a great amount of time is consumed for searching and classifying word-by-word in the word segmentation process, the keyword is easy to miss, if a manufacturer word is to be segmented, the word containing the manufacturer needs to be found out from the hundreds of thousands of long-tailed words one by one, the words are classified together, thus the word classification needs to be manually screened once from the hundreds of thousands of words, and if all the words are to be segmented, the words need to be manually extracted for many times. Through the word segmentation system, long-tail words containing manufacturers can be automatically extracted by inputting the root word manufacturer system, and are classified according to the structures of the manufacturers at the head, the middle and the tail. Therefore, the traditional manual word segmentation still has a plurality of disadvantages in the operation process.
Disclosure of Invention
Aiming at the technical problems in the related art, the invention provides a method for quickly segmenting words in network popularization, which can overcome the defects of the prior art.
In order to achieve the technical purpose, the technical scheme of the invention is realized as follows:
a method for quickly segmenting words in network popularization comprises the following steps:
s1, the user inputs keywords, and the word segmentation system automatically excavates all long tail words containing the keywords according to the keywords input by the user and stores the long tail words as txt files;
s2, the word segmentation system reads all long-tail words from the txt file, performs word segmentation according to Chinese, breaks up all long-tail words, extracts keywords with high occurrence frequency, extracts high-frequency word roots and returns the high-frequency word roots to the user;
s3, the user reserves effective words according to the high-frequency root extracted by the word segmentation system;
s4, screening effective roots according to the remaining effective words;
and S5, the word segmentation system performs word segmentation according to the screened effective root words and derives an xls word segmentation table.
Further, the effective word retention is to extract a high-frequency root through a word segmentation system, remove the ineffective words and repeat the operation until all the ineffective words are removed.
Further, the effective word root is screened out by high-frequency word roots in the remaining effective words according to the principle that the parts of speech are similar and the structures are the same until no extractable effective word exists in the remaining effective words.
Further, in the word segmentation stage, after the effective root is selected, the word segmentation system extracts similar keywords from all long-tail words according to the root sequence and classifies the keywords according to all effective roots selected by the user, and in the similar keywords, the keywords with the same length are classified according to the character length, and finally an xls word segmentation table is generated.
Further, the step of dividing the keywords into columns is that after the words of the same type of effective root words are divided into a column, the words with the consistent length of the key words are subdivided into a column according to the length of the key words in each column, then regional words are extracted from each column and are divided into a column, and the content of each column is operated in a circulating mode.
The invention has the beneficial effects that: when the same type of keywords are exported through the system, the keywords are automatically grouped according to the length of the characters, so that better popularization is achieved, the word segmentation result is exported to the xls file to the local through one key, the problems that the word segmentation speed is low and the keywords are omitted are solved, the time efficiency of screening effective words from a large number of keywords and performing classification and integration by enterprises is improved, and the working efficiency and the result are improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
Fig. 1 is a schematic flow chart of a keyword in a word segmentation system of a method for quickly segmenting words in a web browser according to an embodiment of the present invention.
Fig. 2 is a schematic flow chart illustrating an implementation of the method for fast segmenting words in network popularization according to an embodiment of the present invention.
Fig. 3 is a schematic flow chart of an implementation process of the method for quickly segmenting words in network popularization according to the embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present invention.
As shown in fig. 1-2, the method for fast segmenting words in network promotion according to the embodiment of the present invention includes that a user inputs keywords, and a word segmentation system automatically mines all long-tailed words including the keywords according to the keywords input by the user and stores the long-tailed words as txt long-tailed word files; the word segmentation system reads all long-tail words from the txt long-tail word file, breaks up all long-tail words according to the Chinese word segmentation root word technology, integrates and counts the roots with high occurrence frequency, directly analyzes common roots and the occurrence frequency, namely extracts keywords with high occurrence frequency, extracts and summarizes high-frequency roots, and returns the high-frequency roots to a user for analysis and use.
And the user removes the invalid words according to the high-frequency word roots automatically extracted by the word segmentation system, repeats the operation until all the invalid words are removed, and retains the remaining valid words until the next step.
And then, screening the effective words from the remaining effective words by the high-frequency root according to the principles of similar part of speech and same structure until no extractable keyword exists in the remaining keywords.
The word segmentation system extracts similar keywords from all long-tail words according to the root sequence of all effective roots selected by a user in all keywords, divides the keywords with the same length into a column according to the character length in the similar keywords, thins and divides the words with the consistent keyword character length into a column according to the keyword character length in each column, extracts regional words from each column, and divides the regional words into a column, so that the content of each column is operated circularly, and finally an xls word segmentation table is generated.
In order to facilitate understanding of the above-described technical aspects of the present invention, the above-described technical aspects of the present invention will be described in detail below in terms of specific usage.
As shown in fig. 3, first, a core word is input: and the system automatically performs long-tail word mining on the FFU, and excavates 13649 long-tail words of the FFU. The system breaks up all keywords to be combined according to the Chinese word segmentation principle, and can screen out 220 roots with high occurrence frequency: shandong, energy conservation, format, leak detection, interlayer, after sale, cleanliness, resistance, achievement, application, encyclopedia, introduction, Germany, recommendation, Zhengzhou, materials, evaluation, problem, smallpox, Futai, ranking, typing, Kunshan, titer, air-out, comparison, laboratory, clean shop, spot inspection, distance, formaldehyde removal, schematic, professional, lifting, cause, pipeline, retrofit, positive pressure, efficiency, times, Wuhan, technology, switch, download, comparison, hundred thousand, DC motor, fiberglass, cleaning, instructions, recovery, Guangdong, treatment, air change … ….
Screening out effective roots and invalid roots according to a part of roots listed above; then, filtering out invalid roots, automatically grouping and sequencing the system according to all the roots, preferably sequencing according to the length of characters, and then automatically screening out regional words for sequencing, if: energy savings, formats, leak detection, interlayers, after sales, resistance, reach, application, encyclopedia, introduction, recommendations, materials, evaluation, questions, ceiling, futai, ranking, typing, titer, air out, comparison, spot inspection, distance, specialty, hoisting, cause, piping, retrofit, positive pressure, efficiency, times, technology, switch, download, comparison, hundred thousand, cleaning, recovery, processing, ventilation, instructions, removal of formaldehyde, schematic, cleanliness, laboratories, clean shops, dc motors, fiberglass … ….
And finally, deriving the xls table according to the divided keywords. Such as the word list of table 1.
ffu + characteristic 9 characters ffu + application 9 characters ffu + application 9 characters ffu + characteristic 8 characters ffu + region 8 characters
ffu Automation ffu food plants ffu laminar flow table ffu mute ffu Chongqing
ffu side air intake ffu clean room ffu mute cover ffu New air ffu Shenzhen
ffu technical grade ffu air shower ffu clean room ffu Suzhou
ffu entrance/exit ffu super clean bench ffu clean room
TABLE 1 word segmentation table
In summary, by means of the technical scheme of the invention, when the same type of keywords are exported by the system, the keywords are automatically grouped according to the length of the characters, so that better popularization is achieved, the word segmentation result is exported to the xls file to the local by one key, the problems of low word segmentation speed and keyword omission are solved, the time efficiency of screening effective words from a large number of keywords by an enterprise and carrying out classification and integration is improved, and the working efficiency and the result are improved.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (5)

1. A method for quickly segmenting words in network popularization is characterized by comprising the following steps:
s1, firstly, a user inputs a keyword, and the word segmentation system automatically excavates all long tail words containing the keyword according to the keyword input by the user and stores the long tail words as a txt file;
s2, the word segmentation system reads all long-tail words from the txt file, performs word segmentation according to the Chinese word segmentation root technology, breaks up all long-tail words, extracts keywords with high occurrence frequency, extracts high-frequency roots and returns the high-frequency roots to the user;
s3, the user screens out invalid roots according to the high-frequency roots extracted by the word segmentation system and retains the remaining valid words;
s4, screening effective roots according to the remaining effective words;
and S5, the word segmentation system performs word segmentation according to the screened effective root words and derives an xls word segmentation table.
2. The method of claim 1, wherein the step of retaining the valid word comprises extracting a high-frequency root word by a word segmentation system, removing invalid words, and repeating the operation until all the invalid words are removed.
3. The method for rapidly segmenting words in network popularization according to claim 1, wherein the effective root is screened out by screening out effective roots from the remaining effective words through high-frequency roots according to the principle that the parts of speech are similar and the structures are the same until no extractable effective words exist in the remaining effective words.
4. The method for fast segmenting words in network popularization according to claim 1, wherein in the word segmentation stage, after the effective root is selected, the word segmentation system extracts similar keywords from all long-tail words according to the root sequence and classifies the keywords according to the character length, and in the similar keywords, the keywords with the same length are ranked according to the character length to finally generate an xls word segmentation table.
5. The method as claimed in claim 4, wherein the keyword clustering is performed by partitioning the same type of effective root words into a row, then refining the words with consistent keyword character length into a row according to the keyword character length in each row, then extracting the regional words from each row, and then dividing the regional words into a row, and cyclically operating the content of each row.
CN202110469657.2A 2021-04-28 2021-04-28 Method for quickly segmenting words in network popularization Expired - Fee Related CN113032683B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110469657.2A CN113032683B (en) 2021-04-28 2021-04-28 Method for quickly segmenting words in network popularization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110469657.2A CN113032683B (en) 2021-04-28 2021-04-28 Method for quickly segmenting words in network popularization

Publications (2)

Publication Number Publication Date
CN113032683A CN113032683A (en) 2021-06-25
CN113032683B true CN113032683B (en) 2021-12-24

Family

ID=76454838

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110469657.2A Expired - Fee Related CN113032683B (en) 2021-04-28 2021-04-28 Method for quickly segmenting words in network popularization

Country Status (1)

Country Link
CN (1) CN113032683B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101008864A (en) * 2006-01-28 2007-08-01 北京优耐数码科技有限公司 Multifunctional and multilingual input system for numeric keyboard and method thereof
CN101315624A (en) * 2007-05-29 2008-12-03 阿里巴巴集团控股有限公司 Text subject recommending method and device
CN103942347A (en) * 2014-05-19 2014-07-23 焦点科技股份有限公司 Word separating method based on multi-dimensional comprehensive lexicon
CN104408173A (en) * 2014-12-11 2015-03-11 焦点科技股份有限公司 Method for automatically extracting kernel keyword based on B2B platform
CN205878370U (en) * 2016-06-06 2017-01-11 深圳市亿鼎达科技有限公司 Household air purifier
CN108304377A (en) * 2017-12-28 2018-07-20 东软集团股份有限公司 A kind of extracting method and relevant apparatus of long-tail word
CN110717104A (en) * 2019-10-11 2020-01-21 广州市丰申网络科技有限公司 Keyword advertisement putting automatic negative keyword method and device
CN111831786A (en) * 2020-07-24 2020-10-27 刘秀萍 Full-text database accurate and efficient retrieval method for perfecting subject term
CN112257439A (en) * 2020-10-30 2021-01-22 上海明略人工智能(集团)有限公司 Method and device for mining hot root through public sentiment data

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8396870B2 (en) * 2009-06-25 2013-03-12 University Of Tennessee Research Foundation Method and apparatus for predicting object properties and events using similarity-based information retrieval and modeling
CN106445921B (en) * 2016-09-29 2019-05-07 北京理工大学 Utilize the Chinese text terminology extraction method of quadratic mutual information
CN107958078A (en) * 2017-12-13 2018-04-24 北京百度网讯科技有限公司 Information generating method and device
CN110032722A (en) * 2018-01-12 2019-07-19 北京京东尚科信息技术有限公司 Text error correction method and device
CN112148886A (en) * 2020-09-04 2020-12-29 上海晏鼠计算机技术股份有限公司 Method and system for constructing content knowledge graph

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101008864A (en) * 2006-01-28 2007-08-01 北京优耐数码科技有限公司 Multifunctional and multilingual input system for numeric keyboard and method thereof
CN101315624A (en) * 2007-05-29 2008-12-03 阿里巴巴集团控股有限公司 Text subject recommending method and device
CN103942347A (en) * 2014-05-19 2014-07-23 焦点科技股份有限公司 Word separating method based on multi-dimensional comprehensive lexicon
CN104408173A (en) * 2014-12-11 2015-03-11 焦点科技股份有限公司 Method for automatically extracting kernel keyword based on B2B platform
CN205878370U (en) * 2016-06-06 2017-01-11 深圳市亿鼎达科技有限公司 Household air purifier
CN108304377A (en) * 2017-12-28 2018-07-20 东软集团股份有限公司 A kind of extracting method and relevant apparatus of long-tail word
CN110717104A (en) * 2019-10-11 2020-01-21 广州市丰申网络科技有限公司 Keyword advertisement putting automatic negative keyword method and device
CN111831786A (en) * 2020-07-24 2020-10-27 刘秀萍 Full-text database accurate and efficient retrieval method for perfecting subject term
CN112257439A (en) * 2020-10-30 2021-01-22 上海明略人工智能(集团)有限公司 Method and device for mining hot root through public sentiment data

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Parsimonious Morpheme Segmentation with an Application to Enriching Word Embeddings;Ahmed El-Kishky 等;《2019 IEEE International Conference on Big Data》;20200224;64-73 *
Zipfian frequency distributions facilitate word segmentation in context;Chigusa Kurumada 等;《Cognition》;20130615;第127卷(第3期);439-453 *
基于Hadoop自动文本分类的研究与实现;张勇勇;《中国优秀硕士学位论文全文数据库(信息科技辑)》;20140115(第01期);I138-2329 *
统计机器翻译中的中文分词策略研究;奚宁;《中国优秀博硕士学位论文全文数据库(信息科技辑)》;20160315(第03期);I138-208 *

Also Published As

Publication number Publication date
CN113032683A (en) 2021-06-25

Similar Documents

Publication Publication Date Title
CN106570148B (en) A kind of attribute extraction method based on convolutional neural networks
CN109189901B (en) Method for automatically discovering new classification and corresponding corpus in intelligent customer service system
CN110334346B (en) Information extraction method and device of PDF (Portable document Format) file
CN109710947B (en) Electric power professional word bank generation method and device
US20160299955A1 (en) Text mining system and tool
CN101136020A (en) System and method for automatically spreading reference data
CN106547915B (en) Intelligent data extracting method based on model library
CN101593200A (en) Chinese Web page classification method based on the keyword frequency analysis
CN110781315A (en) Food safety knowledge map and construction method of related intelligent question-answering system
CN103377249A (en) Keyword putting method and system
CN104820724A (en) Method for obtaining prediction model of knowledge points of text-type education resources and model application method
TW201741908A (en) Method for corresponding element symbols in the specification to the corresponding element terms in claims
CN109783815A (en) A kind of various dimensions network public-opinion big data comparative analysis method
CN111429184A (en) User portrait extraction method based on text information
CN105068986B (en) The comment spam filter method of corpus is updated based on bidirectional iteration and automatic structure
CN113032683B (en) Method for quickly segmenting words in network popularization
Rigaud et al. What do we expect from comic panel extraction?
CN1928854A (en) Syntax analysis method and device for layering Chinese long sentences based on punctuation treatment
CN112784050A (en) Method, device, equipment and medium for generating theme classification data set
CN107291952B (en) Method and device for extracting meaningful strings
CN110825792A (en) High-concurrency distributed data retrieval method based on golang middleware coroutine mode
CN110852059A (en) Grouping-based document content difference comparison visualization analysis method
CN113194332B (en) Multi-policy-based new advertisement discovery method, electronic device and readable storage medium
JP6190341B2 (en) DATA GENERATION DEVICE, DATA GENERATION METHOD, AND PROGRAM
CN107818177B (en) Business intelligent model building method and building device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20211224