CN107589967A

CN107589967A - A kind of big data statistical analysis technique of text level

Info

Publication number: CN107589967A
Application number: CN201710879947.8A
Authority: CN
Inventors: 黄礼成; 张蓉; 邓钢
Original assignee: Nanjing Harlu Mdt Infotech Ltd
Current assignee: Nanjing Harlu Mdt Infotech Ltd
Priority date: 2017-09-26
Filing date: 2017-09-26
Publication date: 2018-01-16

Abstract

The present invention provides a kind of big data statistical analysis technique of text level.The present invention includes（1）With reference to the source code of linux primary interior nuclear tool, original big file is split using xarg, split；（2）Critical field is filtered using grep, sed；（3）Analysis calculating is carried out to static fields using cut, awk.The present invention can be for dozens or even hundreds of GB journal file, or data files of several hundred million records, carries out that number, field are accumulative, field is average, the statistical analysis of field highest minimum etc., it is simple efficiently.

Description

A kind of big data statistical analysis technique of text level

Technical field：

The present invention relates to a kind of big data statistical analysis technique of text level, belong to Internet technical field.

Background technology：

In recent years, with the popularization of computer information technology and the high speed development of Internet technology, computer user gradually from The viewer of information becomes the producer of information, text data scale sharp increase.Typical text data includes extensive The product introduction in content of text, shopping website in webpage and news report, the social media in user comment, news website Short-text message, Email and chat record, caused office documents etc. in work.These text datas gradually show Typical big data feature：The scale of construction is big, updating decision, form complexity is various, quality is uneven.On the one hand, accumulate in these data Contain greatly value, people excavate and utilized the demand of text big data also more and more stronger；Meanwhile increasingly severe letter Breath overload problem result in the appearance of mass text big data.The analysis of text big data and application welcome brand-new opportunity and Challenge.

Text analysis technique is intended to by computer technology to the word, grammer, the language that are included in structureless text-string The information such as justice are indicated, understand and extracted, and excavate and analyze the fact and implicit position, viewpoint and valency present in it Value, and then it is inferred to the intention and purpose of text generation person.Text analyzing is typical natural language processing work, is that text is dug Pick, a basic research problem of information retrieval field.Its crucial subtask mainly has participle, part-of-speech tagging, name entity to know Not, syntactic analysis, semantic character labeling, text classification, text cluster, automatic abstract, sentiment analysis, information extraction, entity With with disambiguation etc..Traditional text analysis technique has been widely used to be known in automatically request-answering system, search engine, user's commercial intention Deng not be in field and system.

In the understanding to big data, people sum up its 4V features, i.e., capacity is big, diversity, speed of production are fast and Value density is low, and substantial amounts of technology and instrument are produced for this, promotes the development in big data field.In order to make good use of big data, How useful feature, and important one side effectively extracted therefrom.

The content of the invention：

The purpose of the present invention is to provide a kind of big data statistical analysis technique of text level for above-mentioned problem, for several Ten even GB up to a hundred journal file, or the data file of several hundred million records, carry out number, field adds up, field is average, word The statistical analysis of section highest minimum etc., a kind of simple efficient instrument is designed, carries out express statistic analysis.

Above-mentioned purpose is realized by following technical scheme：

A kind of big data statistical analysis technique of text level, this method include：

（1）With reference to the source code of linux primary interior nuclear tool, original big file is split using xarg, split；

（2）Critical field is filtered using grep, sed；

（3）Analysis calculating is carried out to static fields using cut, awk.

The big data statistical analysis technique of described text level, step（1）Described in linux primary kernel kit Include sed, awk, grep, split, xarg, cut.

Beneficial effect：

The present invention can be for dozens or even hundreds of GB journal file, or data files of several hundred million records, carry out number, Field is accumulative, field is average, the statistical analysis of field highest minimum etc., it is simple efficiently.

Embodiment：

Embodiment 1：

（2）Critical field is filtered using grep, sed；

（3）Analysis calculating is carried out to static fields using cut, awk.

Technological means disclosed in the present invention program is not limited only to the technological means disclosed in above-mentioned technological means, in addition to The technical scheme being made up of above technical characteristic equivalent substitution.The unaccomplished matter of the present invention, belongs to those skilled in the art's Common knowledge.

Claims

1. a kind of big data statistical analysis technique of text level, it is characterized in that：This method includes：

（2）Critical field is filtered using grep, sed；

（3）Analysis calculating is carried out to static fields using cut, awk.

2. the big data statistical analysis technique of text level according to claim 1, it is characterized in that：Step（1）Described in Linux primary kernel instrument includes sed, awk, grep, split, xarg, cut.