CN107589967A - A kind of big data statistical analysis technique of text level - Google Patents

A kind of big data statistical analysis technique of text level Download PDF

Info

Publication number
CN107589967A
CN107589967A CN201710879947.8A CN201710879947A CN107589967A CN 107589967 A CN107589967 A CN 107589967A CN 201710879947 A CN201710879947 A CN 201710879947A CN 107589967 A CN107589967 A CN 107589967A
Authority
CN
China
Prior art keywords
statistical analysis
big data
field
split
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710879947.8A
Other languages
Chinese (zh)
Inventor
黄礼成
张蓉
邓钢
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Harlu Mdt Infotech Ltd
Original Assignee
Nanjing Harlu Mdt Infotech Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Harlu Mdt Infotech Ltd filed Critical Nanjing Harlu Mdt Infotech Ltd
Priority to CN201710879947.8A priority Critical patent/CN107589967A/en
Publication of CN107589967A publication Critical patent/CN107589967A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of big data statistical analysis technique of text level.The present invention includes(1)With reference to the source code of linux primary interior nuclear tool, original big file is split using xarg, split;(2)Critical field is filtered using grep, sed;(3)Analysis calculating is carried out to static fields using cut, awk.The present invention can be for dozens or even hundreds of GB journal file, or data files of several hundred million records, carries out that number, field are accumulative, field is average, the statistical analysis of field highest minimum etc., it is simple efficiently.

Description

A kind of big data statistical analysis technique of text level
Technical field:
The present invention relates to a kind of big data statistical analysis technique of text level, belong to Internet technical field.
Background technology:
In recent years, with the popularization of computer information technology and the high speed development of Internet technology, computer user gradually from The viewer of information becomes the producer of information, text data scale sharp increase.Typical text data includes extensive The product introduction in content of text, shopping website in webpage and news report, the social media in user comment, news website Short-text message, Email and chat record, caused office documents etc. in work.These text datas gradually show Typical big data feature:The scale of construction is big, updating decision, form complexity is various, quality is uneven.On the one hand, accumulate in these data Contain greatly value, people excavate and utilized the demand of text big data also more and more stronger;Meanwhile increasingly severe letter Breath overload problem result in the appearance of mass text big data.The analysis of text big data and application welcome brand-new opportunity and Challenge.
Text analysis technique is intended to by computer technology to the word, grammer, the language that are included in structureless text-string The information such as justice are indicated, understand and extracted, and excavate and analyze the fact and implicit position, viewpoint and valency present in it Value, and then it is inferred to the intention and purpose of text generation person.Text analyzing is typical natural language processing work, is that text is dug Pick, a basic research problem of information retrieval field.Its crucial subtask mainly has participle, part-of-speech tagging, name entity to know Not, syntactic analysis, semantic character labeling, text classification, text cluster, automatic abstract, sentiment analysis, information extraction, entity With with disambiguation etc..Traditional text analysis technique has been widely used to be known in automatically request-answering system, search engine, user's commercial intention Deng not be in field and system.
In the understanding to big data, people sum up its 4V features, i.e., capacity is big, diversity, speed of production are fast and Value density is low, and substantial amounts of technology and instrument are produced for this, promotes the development in big data field.In order to make good use of big data, How useful feature, and important one side effectively extracted therefrom.
The content of the invention:
The purpose of the present invention is to provide a kind of big data statistical analysis technique of text level for above-mentioned problem, for several Ten even GB up to a hundred journal file, or the data file of several hundred million records, carry out number, field adds up, field is average, word The statistical analysis of section highest minimum etc., a kind of simple efficient instrument is designed, carries out express statistic analysis.
Above-mentioned purpose is realized by following technical scheme:
A kind of big data statistical analysis technique of text level, this method include:
(1)With reference to the source code of linux primary interior nuclear tool, original big file is split using xarg, split;
(2)Critical field is filtered using grep, sed;
(3)Analysis calculating is carried out to static fields using cut, awk.
The big data statistical analysis technique of described text level, step(1)Described in linux primary kernel kit Include sed, awk, grep, split, xarg, cut.
Beneficial effect:
The present invention can be for dozens or even hundreds of GB journal file, or data files of several hundred million records, carry out number, Field is accumulative, field is average, the statistical analysis of field highest minimum etc., it is simple efficiently.
Embodiment:
Embodiment 1:
A kind of big data statistical analysis technique of text level, this method include:
(1)With reference to the source code of linux primary interior nuclear tool, original big file is split using xarg, split;
(2)Critical field is filtered using grep, sed;
(3)Analysis calculating is carried out to static fields using cut, awk.
The big data statistical analysis technique of described text level, step(1)Described in linux primary kernel kit Include sed, awk, grep, split, xarg, cut.
Technological means disclosed in the present invention program is not limited only to the technological means disclosed in above-mentioned technological means, in addition to The technical scheme being made up of above technical characteristic equivalent substitution.The unaccomplished matter of the present invention, belongs to those skilled in the art's Common knowledge.

Claims (2)

1. a kind of big data statistical analysis technique of text level, it is characterized in that:This method includes:
(1)With reference to the source code of linux primary interior nuclear tool, original big file is split using xarg, split;
(2)Critical field is filtered using grep, sed;
(3)Analysis calculating is carried out to static fields using cut, awk.
2. the big data statistical analysis technique of text level according to claim 1, it is characterized in that:Step(1)Described in Linux primary kernel instrument includes sed, awk, grep, split, xarg, cut.
CN201710879947.8A 2017-09-26 2017-09-26 A kind of big data statistical analysis technique of text level Pending CN107589967A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710879947.8A CN107589967A (en) 2017-09-26 2017-09-26 A kind of big data statistical analysis technique of text level

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710879947.8A CN107589967A (en) 2017-09-26 2017-09-26 A kind of big data statistical analysis technique of text level

Publications (1)

Publication Number Publication Date
CN107589967A true CN107589967A (en) 2018-01-16

Family

ID=61047558

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710879947.8A Pending CN107589967A (en) 2017-09-26 2017-09-26 A kind of big data statistical analysis technique of text level

Country Status (1)

Country Link
CN (1) CN107589967A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111210824A (en) * 2018-11-21 2020-05-29 深圳绿米联创科技有限公司 Voice information processing method and device, electronic equipment and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106254096A (en) * 2016-07-21 2016-12-21 柳州龙辉科技有限公司 A kind of processing means of Linux daily record

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106254096A (en) * 2016-07-21 2016-12-21 柳州龙辉科技有限公司 A kind of processing means of Linux daily record

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
MYLITTLEGARDEN: "linux中的cut/tr/join/split/xargs命令", 《博客,HTTPS://WWW.CNBLOGS.COM/SUNADA2005/P/3412801.HTML》 *
QIAODELI111: "cut、awk、sed 的使用场景", 《论坛,HTTP://F.DATAGURU.CN/LINUX-744108-1-1.HTML》 *
新美好时代: "Linux中grep、sed、awk使用介绍", 《博客,HTTPS://WWW.CNBLOGS.COM/NICETIME/P/6684229.HTML》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111210824A (en) * 2018-11-21 2020-05-29 深圳绿米联创科技有限公司 Voice information processing method and device, electronic equipment and storage medium
CN111210824B (en) * 2018-11-21 2023-04-07 深圳绿米联创科技有限公司 Voice information processing method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
US20210342404A1 (en) System and method for indexing electronic discovery data
Koto et al. Inset lexicon: Evaluation of a word list for Indonesian sentiment analysis in microblogs
CN102207948B (en) Method for generating incident statement sentence material base
CN101539904B (en) Automatic indexing method of quotations
CN104281653B (en) A kind of opining mining method for millions scale microblogging text
US20060200341A1 (en) Method and apparatus for processing sentiment-bearing text
US20060200342A1 (en) System for processing sentiment-bearing text
EP2923282B1 (en) Segmented graphical review system and method
Loza et al. Building a Dataset for Summarization and Keyword Extraction from Emails.
CN104199845B (en) Line Evaluation based on agent model discusses sensibility classification method
CN111460162A (en) Text classification method and device, terminal equipment and computer readable storage medium
Lejeune et al. A new proposal for evaluating web page cleaning tools
CN107315799A (en) A kind of internet duplicate message screening technique and system
Philemon et al. A machine learning approach to multi-scale sentiment analysis of amharic online posts
CN112200674B (en) Stock market emotion index intelligent calculation information system
Negara et al. Topic modeling using latent dirichlet allocation (LDA) on twitter data with Indonesia keyword
CN107589967A (en) A kind of big data statistical analysis technique of text level
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
CN107451215B (en) Feature text extraction method and device
CN111026940A (en) Network public opinion and risk information monitoring system and electronic equipment for power grid electromagnetic environment
CN102207947B (en) Direct speech material library generation method
CN102622405B (en) Method for computing text distance between short texts based on language content unit number evaluation
CN113449063B (en) Method and device for constructing document structure information retrieval library
Makinist et al. Preparation of improved Turkish dataset for sentiment analysis in social media
CN112488593B (en) Auxiliary bid evaluation system and method for bidding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180116