CN107589967A - A kind of big data statistical analysis technique of text level - Google Patents
A kind of big data statistical analysis technique of text level Download PDFInfo
- Publication number
- CN107589967A CN107589967A CN201710879947.8A CN201710879947A CN107589967A CN 107589967 A CN107589967 A CN 107589967A CN 201710879947 A CN201710879947 A CN 201710879947A CN 107589967 A CN107589967 A CN 107589967A
- Authority
- CN
- China
- Prior art keywords
- statistical analysis
- big data
- field
- split
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides a kind of big data statistical analysis technique of text level.The present invention includes(1)With reference to the source code of linux primary interior nuclear tool, original big file is split using xarg, split;(2)Critical field is filtered using grep, sed;(3)Analysis calculating is carried out to static fields using cut, awk.The present invention can be for dozens or even hundreds of GB journal file, or data files of several hundred million records, carries out that number, field are accumulative, field is average, the statistical analysis of field highest minimum etc., it is simple efficiently.
Description
Technical field:
The present invention relates to a kind of big data statistical analysis technique of text level, belong to Internet technical field.
Background technology:
In recent years, with the popularization of computer information technology and the high speed development of Internet technology, computer user gradually from
The viewer of information becomes the producer of information, text data scale sharp increase.Typical text data includes extensive
The product introduction in content of text, shopping website in webpage and news report, the social media in user comment, news website
Short-text message, Email and chat record, caused office documents etc. in work.These text datas gradually show
Typical big data feature:The scale of construction is big, updating decision, form complexity is various, quality is uneven.On the one hand, accumulate in these data
Contain greatly value, people excavate and utilized the demand of text big data also more and more stronger;Meanwhile increasingly severe letter
Breath overload problem result in the appearance of mass text big data.The analysis of text big data and application welcome brand-new opportunity and
Challenge.
Text analysis technique is intended to by computer technology to the word, grammer, the language that are included in structureless text-string
The information such as justice are indicated, understand and extracted, and excavate and analyze the fact and implicit position, viewpoint and valency present in it
Value, and then it is inferred to the intention and purpose of text generation person.Text analyzing is typical natural language processing work, is that text is dug
Pick, a basic research problem of information retrieval field.Its crucial subtask mainly has participle, part-of-speech tagging, name entity to know
Not, syntactic analysis, semantic character labeling, text classification, text cluster, automatic abstract, sentiment analysis, information extraction, entity
With with disambiguation etc..Traditional text analysis technique has been widely used to be known in automatically request-answering system, search engine, user's commercial intention
Deng not be in field and system.
In the understanding to big data, people sum up its 4V features, i.e., capacity is big, diversity, speed of production are fast and
Value density is low, and substantial amounts of technology and instrument are produced for this, promotes the development in big data field.In order to make good use of big data,
How useful feature, and important one side effectively extracted therefrom.
The content of the invention:
The purpose of the present invention is to provide a kind of big data statistical analysis technique of text level for above-mentioned problem, for several
Ten even GB up to a hundred journal file, or the data file of several hundred million records, carry out number, field adds up, field is average, word
The statistical analysis of section highest minimum etc., a kind of simple efficient instrument is designed, carries out express statistic analysis.
Above-mentioned purpose is realized by following technical scheme:
A kind of big data statistical analysis technique of text level, this method include:
(1)With reference to the source code of linux primary interior nuclear tool, original big file is split using xarg, split;
(2)Critical field is filtered using grep, sed;
(3)Analysis calculating is carried out to static fields using cut, awk.
The big data statistical analysis technique of described text level, step(1)Described in linux primary kernel kit
Include sed, awk, grep, split, xarg, cut.
Beneficial effect:
The present invention can be for dozens or even hundreds of GB journal file, or data files of several hundred million records, carry out number,
Field is accumulative, field is average, the statistical analysis of field highest minimum etc., it is simple efficiently.
Embodiment:
Embodiment 1:
A kind of big data statistical analysis technique of text level, this method include:
(1)With reference to the source code of linux primary interior nuclear tool, original big file is split using xarg, split;
(2)Critical field is filtered using grep, sed;
(3)Analysis calculating is carried out to static fields using cut, awk.
The big data statistical analysis technique of described text level, step(1)Described in linux primary kernel kit
Include sed, awk, grep, split, xarg, cut.
Technological means disclosed in the present invention program is not limited only to the technological means disclosed in above-mentioned technological means, in addition to
The technical scheme being made up of above technical characteristic equivalent substitution.The unaccomplished matter of the present invention, belongs to those skilled in the art's
Common knowledge.
Claims (2)
1. a kind of big data statistical analysis technique of text level, it is characterized in that:This method includes:
(1)With reference to the source code of linux primary interior nuclear tool, original big file is split using xarg, split;
(2)Critical field is filtered using grep, sed;
(3)Analysis calculating is carried out to static fields using cut, awk.
2. the big data statistical analysis technique of text level according to claim 1, it is characterized in that:Step(1)Described in
Linux primary kernel instrument includes sed, awk, grep, split, xarg, cut.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710879947.8A CN107589967A (en) | 2017-09-26 | 2017-09-26 | A kind of big data statistical analysis technique of text level |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710879947.8A CN107589967A (en) | 2017-09-26 | 2017-09-26 | A kind of big data statistical analysis technique of text level |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107589967A true CN107589967A (en) | 2018-01-16 |
Family
ID=61047558
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710879947.8A Pending CN107589967A (en) | 2017-09-26 | 2017-09-26 | A kind of big data statistical analysis technique of text level |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107589967A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111210824A (en) * | 2018-11-21 | 2020-05-29 | 深圳绿米联创科技有限公司 | Voice information processing method and device, electronic equipment and storage medium |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106254096A (en) * | 2016-07-21 | 2016-12-21 | 柳州龙辉科技有限公司 | A kind of processing means of Linux daily record |
-
2017
- 2017-09-26 CN CN201710879947.8A patent/CN107589967A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106254096A (en) * | 2016-07-21 | 2016-12-21 | 柳州龙辉科技有限公司 | A kind of processing means of Linux daily record |
Non-Patent Citations (3)
Title |
---|
MYLITTLEGARDEN: "linux中的cut/tr/join/split/xargs命令", 《博客,HTTPS://WWW.CNBLOGS.COM/SUNADA2005/P/3412801.HTML》 * |
QIAODELI111: "cut、awk、sed 的使用场景", 《论坛,HTTP://F.DATAGURU.CN/LINUX-744108-1-1.HTML》 * |
新美好时代: "Linux中grep、sed、awk使用介绍", 《博客,HTTPS://WWW.CNBLOGS.COM/NICETIME/P/6684229.HTML》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111210824A (en) * | 2018-11-21 | 2020-05-29 | 深圳绿米联创科技有限公司 | Voice information processing method and device, electronic equipment and storage medium |
CN111210824B (en) * | 2018-11-21 | 2023-04-07 | 深圳绿米联创科技有限公司 | Voice information processing method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20210342404A1 (en) | System and method for indexing electronic discovery data | |
Koto et al. | Inset lexicon: Evaluation of a word list for Indonesian sentiment analysis in microblogs | |
CN102207948B (en) | Method for generating incident statement sentence material base | |
CN101539904B (en) | Automatic indexing method of quotations | |
CN104281653B (en) | A kind of opining mining method for millions scale microblogging text | |
US20060200341A1 (en) | Method and apparatus for processing sentiment-bearing text | |
US20060200342A1 (en) | System for processing sentiment-bearing text | |
EP2923282B1 (en) | Segmented graphical review system and method | |
Loza et al. | Building a Dataset for Summarization and Keyword Extraction from Emails. | |
CN104199845B (en) | Line Evaluation based on agent model discusses sensibility classification method | |
CN111460162A (en) | Text classification method and device, terminal equipment and computer readable storage medium | |
Lejeune et al. | A new proposal for evaluating web page cleaning tools | |
CN107315799A (en) | A kind of internet duplicate message screening technique and system | |
Philemon et al. | A machine learning approach to multi-scale sentiment analysis of amharic online posts | |
CN112200674B (en) | Stock market emotion index intelligent calculation information system | |
Negara et al. | Topic modeling using latent dirichlet allocation (LDA) on twitter data with Indonesia keyword | |
CN107589967A (en) | A kind of big data statistical analysis technique of text level | |
CN110705285B (en) | Government affair text subject word library construction method, device, server and readable storage medium | |
CN107451215B (en) | Feature text extraction method and device | |
CN111026940A (en) | Network public opinion and risk information monitoring system and electronic equipment for power grid electromagnetic environment | |
CN102207947B (en) | Direct speech material library generation method | |
CN102622405B (en) | Method for computing text distance between short texts based on language content unit number evaluation | |
CN113449063B (en) | Method and device for constructing document structure information retrieval library | |
Makinist et al. | Preparation of improved Turkish dataset for sentiment analysis in social media | |
CN112488593B (en) | Auxiliary bid evaluation system and method for bidding |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180116 |