CN106227885A - Processing method, device and the terminal of a kind of big data - Google Patents

Processing method, device and the terminal of a kind of big data Download PDF

Info

Publication number
CN106227885A
CN106227885A CN201610643603.2A CN201610643603A CN106227885A CN 106227885 A CN106227885 A CN 106227885A CN 201610643603 A CN201610643603 A CN 201610643603A CN 106227885 A CN106227885 A CN 106227885A
Authority
CN
China
Prior art keywords
data
word
structuring
structural
internet
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610643603.2A
Other languages
Chinese (zh)
Inventor
杨志敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xinghe Group Co Ltd
Original Assignee
Xinghe Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xinghe Group Co Ltd filed Critical Xinghe Group Co Ltd
Priority to CN201610643603.2A priority Critical patent/CN106227885A/en
Publication of CN106227885A publication Critical patent/CN106227885A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses processing method, device and the terminal of a kind of big data, the method includes: gather the big data in the Internet;Described big data are carried out structuring process;Described structuring is processed the structural data obtained use.The solution of the present invention, the defect such as can overcome in prior art that intractability is big, it is big to take up room and utilization rate is low, it is achieved intractability is little, it is little to take up room and utilization rate is high beneficial effect.

Description

Processing method, device and the terminal of a kind of big data
Technical field
The invention belongs to technical field of data processing, be specifically related to processing method, device and the terminal of a kind of big data, especially Its relate to the extraction of a kind of the Internet unstructured data name entity device corresponding with the method with modeling method and There is the terminal of this device.
Background technology
At present, Internet technology high speed development, the data being deposited in the Internet are that volatile exponential type increases, 21 generation Since recording just, because the fast development of network hardware facility and storage medium are more and more cheap, the storage of data in the Internet Amount has reached unprecedented huge especially, and in the world, almost everyone is that it contributes data resource endlessly.
Big data are one of focuses of IT industry in recent years, and it gradually becomes extensively in the application of industry-by-industry.Greatly Data, also known as flood tide data, refer to involved data information magnitude scale the hugest soft to passing through human brain even main flow Part instrument, reach within reasonable time to capture, manage, process and arrange become help enterprise management decision-making have huge valency The information of value.
Under this background, the field such as technology, business, management and finance all in the huge change that occurs quietly, The ideological trend of people has also begun to the change of a new round, meets the arriving in " big data " epoch, experiences and adapt to " big data " epoch bring the great change of life style or even form of thinking.
But, such substantial amounts of data are more to be free in extensively with non-structured, diversified, discrete form In wealthy network world, go, without method and the technology of science, the knowledge that " excavation " wherein contained, then this is huge Data wealth is by without any ample scope for abilities.
In prior art, the defect such as have that intractability is big, it is big to take up room and utilization rate is low.
Summary of the invention
It is an object of the invention to, for drawbacks described above, it is provided that processing method, device and the terminal of a kind of big data, with Solve the problem that in prior art, in the Internet, the amount of storage of data utilizes greatly but without reality, reach to promote the effect of utilization rate.
The present invention provides the processing method of a kind of big data, including: gather the big data in the Internet;To described big data Carry out structuring process;Described structuring is processed the structural data obtained use.
Alternatively, gather the big data in the Internet, including: by web crawlers technology, obtain in the Internet with non-knot Structure the Internet text is main data;Based on the described data obtained, build destructuring information bank.
Alternatively, described big data are carried out structuring process, including: according to the described data collected and the knot preset Structure datum target, creates data model;By described data model, extract the unstructured data in described data, and right Described unstructured data carries out preliminary formatting process;Unstructured data number after described preliminary structureization is processed According to cleaning and Unified coding process, obtain required structural data.
Alternatively, described structuring is processed the structural data obtained and uses, including: according to predetermined gain ratio step, Described structural data is carried out information gain tolerance Attributions selection, selects information gain tolerance attribute to exceed described ratio of profit increase Attribute divides;The structural data meeting granularity after division less than presetting fineness carries out various dimensions aggregation process, and Extract the structural data being met default dimension.
Alternatively, described structuring is processed the structural data obtained and uses, also include: described extraction is obtained Described structural data, the encapsulation carrying out various dimensions and represent at least one operation.
Matching with said method, another aspect of the present invention provides the processing means of a kind of big data, including: gather single Unit, for gathering the big data in the Internet;Structuring unit, for carrying out structuring process to described big data;Configuration is single Unit, uses for described structuring is processed the structural data obtained.
Alternatively, collecting unit, including: acquisition module, for by web crawlers technology, obtain in the Internet with non-knot Structure the Internet text is main data;Memory module, for based on the described data obtained, builds destructuring information bank.
Alternatively, structuring unit, including: creation module, for according to the described data collected and the structure of presetting Change datum target, create data model;Formatting module, for by described data model, extracts the non-knot in described data Structure data, and described unstructured data is carried out preliminary formatting process;Clean and coding module, for described tentatively Unstructured data after structuring processes carries out data cleansing and Unified coding process, obtains required structural data.
Alternatively, dispensing unit, including: division module, extraction module, application module at least one;Wherein, described point Split module, for according to predetermined gain ratio step, described structural data is carried out information gain tolerance Attributions selection, selects information to increase Benefit metric attribute exceedes the attribute of described ratio of profit increase and divides;Described extraction module, for meeting low to granularity after division Structural data in default fineness carries out various dimensions aggregation process, and extracts the structural data being met default dimension; Described application module, for the described structural data obtaining described extraction, carries out the encapsulation of various dimensions and represents at least A kind of operation.
Matching with said apparatus, further aspect of the present invention provides a kind of terminal, including: the place of above-described big data Reason device.
The solution of the present invention, by extraction and the modeling technique of the Internet unstructured data name entity, it is possible to achieve Process to big data, and then big data are utilized, resource utilization is high, and the feature of environmental protection is good.
Further, the solution of the present invention, by using the Internet unstructured data name entity of this Technology design Extract and modeling method and application system thereof, big data can be carried out structuring, and then reduce memory space, save storage money Source.
Further, the solution of the present invention, by be applicable to the Internet unstructured text data based on natural language at The name entity extraction of reason (such as: hidden Markov model, word cartographical representation and ID3 algorithm altogether) and modeling method and application are System, in combination with Hadoop (be i.e. one by the distributed system architecture of Apache fund club exploitation) distributed skill Art such that it is able to preferably adapt to huge, the discrete data characteristics in the Internet and keep autgmentability flexibly.
Thus, the solution of the present invention, by gathering the data in the Internet, and the data collected are carried out at structuring Reason, solves the problem that in prior art, in the Internet, the amount of storage of data utilizes greatly but without reality, thus, overcome in prior art Intractability is big, it is big to take up room and utilization rate is low defect, it is achieved intractability is little, it is little to take up room and high the having of utilization rate Benefit effect.
Other features and advantages of the present invention will illustrate in the following description, and, partly become from description Obtain it is clear that or understand by implementing the present invention.
Below by drawings and Examples, technical scheme is described in further detail.
Accompanying drawing explanation
Fig. 1 is the schematic flow sheet of an embodiment of the processing method of the big data of the present invention;
Fig. 2 is the schematic flow sheet of an embodiment of acquisition process in assembly of the invention;
Fig. 3 is the schematic flow sheet of the embodiment that structuring processes in assembly of the invention;
Fig. 4 is the schematic flow sheet of the embodiment that use processes in assembly of the invention;
Fig. 5 is the structural representation of an embodiment of the processing means of the big data of the present invention;
Fig. 6 is the structural representation of the Scrapy reptile framework of an embodiment of the terminal of the present invention;
Fig. 7 is the general principles schematic diagram of an embodiment of the terminal of the present invention.
In conjunction with accompanying drawing, in the embodiment of the present invention, reference is as follows:
102-collecting unit;1022-acquisition module;1024-memory module;104-structuring unit;1042-creates mould Block;1044-formatting module;1046-cleans and coding module;106-dispensing unit;1062-divides module;1064-extracts mould Block;1066-application module.
Detailed description of the invention
For making the object, technical solutions and advantages of the present invention clearer, below in conjunction with the specific embodiment of the invention and Technical solution of the present invention is clearly and completely described by corresponding accompanying drawing.Obviously, described embodiment is only the present invention one Section Example rather than whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art are not doing Go out the every other embodiment obtained under creative work premise, broadly fall into the scope of protection of the invention.
According to embodiments of the invention, it is provided that the processing method of a kind of big data, the method for the present invention as shown in Figure 1 The flow chart of one embodiment.The processing method of these big data may include that
At step S110, gather the big data in the Internet.
Such as: text datas these magnanimity of Internet, discrete, that come in every shape.
It is alternatively possible to combine the schematic flow sheet of an embodiment of acquisition process in assembly of the invention shown in Fig. 2, enter The detailed process of the big data in collection the Internet in one step explanation step S110.
Step S210, by web crawlers technology, obtains the data based on destructuring the Internet text in the Internet.
Such as: magnanimity can be utilized current ripe climbing based on the data acquisition of destructuring the Internet text Worm technology, the Internet reptile is by extensively traveling through the network being made up of the link in website, and continuous print obtains a large amount of institutes The information needed, utilizes current ripe web crawlers theory can construct efficient web crawlers.
Step S220, based on the described data obtained, builds destructuring information bank.
Such as: form destructuring information bank, after determining data source and scope of data, carry out the network information and crawl with non- Structuring information bank builds.
Such as: from data source, crawl name entity and relevant text interpretation, and the storage of its persistence is arrived In mongodb unstructured data storehouse.
In one example, carry out the network information to crawl and destructuring information bank structure.Python instrument is used to write Crawlers, crawls name entity and relevant text interpretation from data source, and by the storage of its persistence to mongodb In unstructured data storehouse, system crawlers uses python scrapy instrument to realize, it is necessary first under nonspecific pattern It is simulated logging in, and the automatic switchover useragent that is in due course, automatic switchover IP agency the most when being necessary, For availability and the reliability of safeguards system, for the unusual condition that some are serious, the system contained by the present invention possesses automatically Sending the function of mail notification, all of program operation process all can be preserved with the form of daily record.Scrapy (i.e. Python One of exploitation is quick, high-level screen scraping and web crawl framework, is used for capturing web site and extracting knot from the page The data of structure) employ storehouse, twisted asynchronous networking to process network communication, overall architecture is as shown in Figure 1.
Wherein, Scrapy mainly includes following assembly:
(1) engine (Scrapy Engine), is used for processing the Data Stream Processing of whole system, triggers affairs.
(2) scheduler (Scheduler), is used for accepting the request that engine sends, in press-in queue, and at engine again Return the when of request.
(3) downloader (Downloader), is used for downloading web page contents, and web page contents returns to reptile framework.
(4) reptile framework (Spider), is used for working out the resolution rules of certain domain name or webpage.
(5) project pipeline (Item Pipeline), responsible process has the project that reptile framework extracts from webpage, his master Task is wanted to be cleaning, verify and store data.After the page is by reptile Analysis on Framework, project pipeline will be sent to, and pass through Several specific order process data.
(6) downloader middleware (Downloader Middlewares), the hook between Scrapy engine and downloader Subframe, mainly processes the request between Scrapy engine and downloader and response.
(7) reptile middleware (Spider Middlewares), the hook frame between Scrapy engine and reptile framework Frame, groundwork is response input and the request output processing reptile framework.
(8) dispatch middleware (Scheduler Middlewares), the middleware between Scrapy engine and scheduling, Request and the response of scheduling it is sent to from Scrapy engine.
Thus, by crawling big data and storing, can be that the offer of the process to big data precisely and reliably depends on According to, beneficially improve the high efficiency that big data are processed and convenience.
At step S120, described big data are carried out structuring process.
Such as: build the distributed extraction framework of an iteration.
It is alternatively possible to combine the schematic flow sheet of the embodiment that structuring processes in assembly of the invention shown in Fig. 3, Further illustrate the detailed process that in step S120, described big data are carried out structuring process.
Step S310, according to the described data collected and the structural data target preset, creates data model.
Such as: data analysis, for the application target of data the industry field association attributes that combines specialty, knot is made The datum target of structure.
Such as: data modeling, in conjunction with the unstructured data crawled and set structural data target, use Power Designer instrument creates data model.
Step S320, by described data model, extracts the unstructured data in described data, and to described non-structural Change data and carry out preliminary formatting process.
Such as: data are tentatively formatted, as it is shown in fig. 7, use the non-structural in ETL instrument extraction mongodb Change text data, and be stored to the data center storehouse built with mysql5.7 with relevant formatting method and timed task In the STAGE of storehouse, data now are through the preliminary unstructured data formatted.
Step S330, the unstructured data after processing described preliminary structureization is carried out at data cleansing and Unified coding Reason, obtains required structural data.
In one example, structural data, as it is shown in fig. 7, use thinking based on natural language processing, by writing Storing process the text (non-structured) formatted is carried out data cleansing, Unified coding, and coordinate ETL instrument, Crontab timed task, by its automated storing to data warehouse ODS layer, now data have possessed structurized characteristic, in order to Reach this target, mainly use hidden Markov model and the word cartographical representation destructuring to magnanimity altogether in this stage Data are carried out.
Such as: hidden Markov model and word cartographical representation altogether can be used, the identification of name entity will be effectively improved Efficiency, in conjunction with ID3 algorithm, structure that therefrom extract most worthy, that the most standardized, the data wall scroll scale of construction minimizes Change data, provide valid data convenient, flexible, accurate, manageable to support for subsequent applications.
Such as: hidden Markov model, in natural language processing, it is applied to word segmentation and the part-of-speech tagging side of Chinese Face, mainly for unordered unstructured text data.In part-of-speech tagging, part of speech sequence is hiding at mark money, is to need Target to be solved, given word string is then the sequence of observable symbol, is known condition before mark.If part-of-speech tagging Problem model is understood as a HMM, then (status number of HMM determines that) that the set of part-of-speech tagging determines that, each part of speech Corresponding word determines that, in dictionary, each word has one or several part of speech labellings determined.At hidden Ma Er Can be under husband's model, part-of-speech tagging problem can be expressed as: at given word (observed value) W=(w1, w2, w3, w4... wm) sequence, ask Part of speech (state) sequence T=(t that possible row is maximum1, t2, t3, t4..., tm) make conditional probability P (T | W) maximum.P (T | W) mesh Front being difficult to is estimated, general employing Bayes changes, it may be assumed that
P ( T | W ) = P ( T ) P ( W | T ) P ( W )
In part-of-speech tagging, W is given, and P (W) does not relies on T, once when calculating P (T | W), can not consider P (W), application joint probability formula P (A, B)=P (A) P simultaneously (B | A) have:
P (T | W)=P (T) P (W | T)
Applied probability multiplication formula further to above formula, has:
P ( T | W ) = P ( w 1 , ... , m t 1 , ... , m ) = Π i = 1... m P ( W i , t i | W 1 , ... , i - 1 , t 1 , ... , i - 1 ) = Π i = 1... m P ( W i | W 1 , ... , i - 1 , t 1 , ... , i - 1 ) x P ( t i | W 1 , ... , i - 1 , t 1 , ... , i - 1 ) .
W in above formula1..., i=W1, W2... Wi, t1... i==t1, t2... ti, 1≤i≤m.
Word altogether cartographical representation, when two words are adjacent occur time, claim the co-occurrence word of the two word the other side each other.Assume In a large amount of language materials, all different morphologies become dictionary, D=[W0, W1, W2..., Wn].Each word Wi is expressed as a vectorial F (W) =[F0, F1, F2..., Fn].Fi represents that whether word Wi is the co-occurrence word of W.Any two word is if there is similar In context, then it is assumed that they are semantic relevant.Any two word Wi, WjThe number that degree of association is its common co-occurrence word, I.e. vector F (Wi)·F(Wj).For convenience's sake, the figure that the method is constituted becomes One-Hot co-occurrence figure.Only consider identical Co-occurrence word number can not effectively carry out the comparison of semantic dependency, and this method does not accounts between each word Co-occurrence degree.The number of times of two word co-occurrences is the most, and two words are more likely to be commonly used collocation.Semanteme in commonly used collocation is right In comparing whether two words are correlated with extremely important.Such as with a lot of verb collocation before " time ", such as, " set up ", " exiting ", " typing ", " investment ", etc., but " investment " has extremely abundant context, should not have identical power with " establishment " Weight.In Chinese, word uses very flexible, and the context of major part word is widely distributed, and the method only by counting can produce A large amount of noises.Therefore, word analogy document, used the method for similar Tf-Idf to calculate the weight on limit, claimed the party's Vinculum iuris structure The figure made is Tf-Idf co-occurrence figure.Formula is as follows:
Make dictionary D=[W0, W1, W2..., Wn], make FC=[FC0, FC1, FC2..., FCn] represent the co-occurrence of each word The number of word.
Make TF (W)=[TF0, TF1, TF2..., TFn], TFiRepresent the co-occurrence number of times of each word and W.Obviously
Make TfIdf (W)=[TfIdf0, TfIdf1, TfIdf2 ..., TfIdfn], TfIdfi represents each word and W Tightness degree.According to the computing formula of TfIdf, there is TfIdfi=TFi*log (N+1/FC1)。
Thus, by unstructured data is processed into structural data, can be to storages based on big data and use Thering is provided reliable form, hommization is good.
At step S130, described structuring is processed the structural data obtained and uses.
In one example, complete these resources based on unstructured data are automatically resolved and managed, from And realize the extraction of destructuring name entity and corresponded to the relation mould of structuring name entity by modeling technique Type network.
In one example, data depth resolves, as it is shown in fig. 7, introduce ID3 algorithm to refine valuable data further Storage is to data warehouse DW layer, and in order to refine simple and direct, the data of most worthy, the present invention has selected ID3 algorithm to this herein A little data having possessed structural features carry out deep analysis and extraction, and method is as follows:
ID3 algorithm, from theory of information knowledge, user obtains, it is desirable to information is the least, and information gain is the biggest, thus purity is more High.So the core concept of ID3 algorithm is exactly to measure Attributions selection with information gain, the genus that after selecting to divide, information gain is maximum Property divides.The most first define several concept to be used.
If D is the division carried out training tuple by classification, wherein pi represents that i-th classification goes out in whole training tuple Existing probability, can be with belonging to the quantity of this class elements divided by training tuple elements total quantity as estimation.The actual meaning of entropy Justice represents the average information required for the class label being tuple in D.User assumes to carry out training tuple D by attribute A now Divide, then the expectation information that D is divided by A is: and information gain is both differences:
Gain (A)=inf o (D)-inf oA(D)。
ID3 algorithm, exactly when needing division every time, calculates the ratio of profit increase of each attribute, then selects ratio of profit increase maximum Attribute divides.
User illustrates how to use ID3 algorithm construction decision tree with the example of infomation detection industrial and commercial in investment institution below. For simplicity, user assumes that industrial and commercial information comprises 10 elements:
Wherein s, m and l represent little respectively, neutralize greatly.If whether L, F, H and R represent daily record density, good friend's density, use True head portrait and account are the truest, then the information gain formula calculating each attribute is:
Inf o (D)=-0.7log20.7-0.3log20.3=0.7*0.51+0.3*1.74=0.879;
Gain (L)=0.879-0.603=0.276.
Data value, as it is shown in fig. 7, combine professional field valueization demand, by the structural data of relatively fine particle degree Carry out various dimensions aggregation process and store data warehouse DM/EDW layer.
Java instrument customization api interface is used based on the data warehouse DM/EDW layer shown in Fig. 2.
Use bootstrap front end Development Framework, echarts icon library, jqueryjavascript storehouse exploitation form system The valuable data that have that are the most extracted to these and that collect out of uniting carry out the encapsulation of various dimensions and represent, and are finally embodied as certainly Plan provides the purpose of important evidence.
Below in conjunction with the schematic flow sheet of the embodiment that use in assembly of the invention shown in Fig. 4 processes, furtherly Bright step S130 processes, to described structuring, the detailed process that the structural data obtained uses.
Step S410, according to predetermined gain ratio step, carries out information gain tolerance Attributions selection, selects described structural data Information gain tolerance attribute exceedes the attribute of described ratio of profit increase and divides.
Step S420, the structural data meeting granularity after division less than presetting fineness carries out various dimensions and collects place Reason, and extract the structural data being met default dimension.
Thus, by the division of structural data and extraction process, so that structural data becomes small data, just In storage, it is simple to transmission, it is simple to use, and it is little to take up room, and is conducive to economizing on resources.
Alternatively, step S130 processes, to described structuring, the detailed process that the structural data obtained uses, Can also include: step S430, the described structural data that described extraction is obtained, carry out the encapsulation of various dimensions and represent to Few one operation.
In one example, (1) system general levels structure is shown in Fig. 7.
(2) data structure, the present invention has selected power Designer instrument to carry out Data Structure Design.
(3) selected hidden Markov model, using storing process as implementation method, for unordered non-structured text Data, by using this model and implementation method to carry out word segmentation and two aspects of part-of-speech tagging of Chinese in name entity Resolve.
Under hidden Markov model, part-of-speech tagging problem can be expressed as: at given word (observed value) W=(w1, w2, w3, w4... wm) sequence, seek part of speech (state) sequence T=(t that possible row is maximum1, t2, t3, t4..., tm) make conditional probability P (T | W) maximum.P (T | W) it is difficult at present estimate, general employing Bayes changes, it may be assumed that
P ( T | W ) = P ( T ) P ( W | T ) P ( W )
In part-of-speech tagging, W is given, and P (W) does not relies on T, once when calculating P (T | W), can not consider P (W), application joint probability formula P (A, B)=P (A) P simultaneously (B | A) have
P (T | W)=P (T) P (W | T)
Applied probability multiplication formula further to above formula, has
P ( T | W ) = P ( w 1 , ... , m t 1 , ... , m ) = Π i = 1... m P ( W i , t i | W 1 , ... , i - 1 , t 1 , ... , i - 1 ) = Π i = 1... m P ( W i | W 1 , ... , i - 1 , t 1 , ... , i - 1 ) x P ( t i | W 1 , ... , i - 1 , t 1 , ... , i - 1 )
W in formula1..., i=W1, W2... Wi, t1... i==t1, t2... ti, 1≤i≤m
(4) select word cartographical representation altogether, resolved by the text data realized in name entity that arranges of weight, Implementation method is as follows:
Word altogether cartographical representation, when two words are adjacent occur time, claim the co-occurrence word of the two word the other side each other.Assume In a large amount of language materials, all different morphologies become dictionary, D=[W0, W1, W2..., Wn].Each word Wi is expressed as a vectorial F (W) =[F0, F1, F2..., Fn].Fi represents that whether word Wi is the co-occurrence word of W.Any two word is if there is similar In context, then it is assumed that they are semantic relevant.Any two word Wi, WjThe number that degree of association is its common co-occurrence word, I.e. vector F (Wi)·F(Wj).For convenience's sake, the figure that the method is constituted becomes One-Hot co-occurrence figure.Only consider identical Co-occurrence word number can not effectively carry out the comparison of semantic dependency, and this method does not accounts between each word Co-occurrence degree.The number of times of two word co-occurrences is the most, and two words are more likely to be commonly used collocation.Semanteme in commonly used collocation is right In comparing whether two words are correlated with extremely important.Such as " time " above with a lot of verb collocation, such as " set up ", " exiting ", " typing ", " investment " etc., but " investment " has extremely abundant context, should not have identical power with " establishment " Weight.In Chinese, word uses very flexible, and the context of major part word is widely distributed, and the method only by counting can produce A large amount of noises.Therefore, word analogy document, used the method for similar Tf-Idf to calculate the weight on limit, claimed the party's Vinculum iuris structure The figure made is Tf-Idf co-occurrence figure.Formula is as follows:
Make dictionary D=[W0, W1, W2..., Wn], make FC=[FC0, FC1, FC2..., FCn] represent the co-occurrence of each word The number of word.
Make TF (W)=[TF0, TF1, TF2..., TFn], TFiRepresent the co-occurrence number of times of each word and W.Obviously
Make TfIdf (W)=[TfIdf0, TfIdf1, TfIdf2 ..., TfIdfn], TfIdfi represents each word and W Tightness degree.According to the computing formula of TfIdf, there is TfIdfi=TFi*log (N+1/FC1)。
(5) select ID3 algorithm to realize the managing of data, value, be easy to the purpose applied.Use from theory of information knowledge Family obtains, it is desirable to information is the least, and information gain is the biggest, thus purity is the highest.So the core concept of ID3 algorithm is exactly with information Gain tolerance Attributions selection, the attribute that after selecting division, information gain is maximum divides.The most first define several to be used Concept.If D is the division carried out training tuple by classification, wherein pi represents that i-th classification occurs in whole training tuple Probability, can be with belonging to the quantity of this class elements divided by training tuple elements total quantity as estimation.The practical significance of entropy Expression is the average information required for the class label of tuple in D.User assumes to carry out drawing by attribute A by training tuple D now Point, then the expectation information that D is divided by A is: and information gain is both differences:
Gain (A)=inf o (D)-inf oA(D)。
ID3 algorithm, exactly when needing division every time, calculates the ratio of profit increase of each attribute, then selects ratio of profit increase maximum Attribute divides.
User illustrates how to use ID3 algorithm construction decision tree with the example of infomation detection industrial and commercial in investment institution below. For simplicity, it can be assumed that industrial and commercial information comprises 10 elements:
Wherein s, m and l represent little respectively, neutralize greatly.If whether L, F, H and R represent daily record density, good friend's density, use True head portrait and account are the truest, then the information gain formula calculating each attribute is:
Inf o (D)=-0.7log20.7-0.3log20.3=0.7*0.51+0.3*1.74=0.879
Gain (L)=0.879-0.603=0.276.
Thus, by the application to the structural data after division and extraction, the utilization rate of big data, environmental protection can be promoted Property good, resource utilization is high.
Through substantial amounts of verification experimental verification, use the technical scheme of the present embodiment, real by the name of the Internet unstructured data The extraction of body and modeling technique, it is possible to achieve the process to big data, and then big data are utilized, resource utilization Height, the feature of environmental protection is good.
According to embodiments of the invention, the process of a kind of big data additionally providing the processing method corresponding to big data fills Put.The structural representation of one embodiment of assembly of the invention shown in Figure 5.The processing means of these big data may include that Collecting unit 102, structuring unit 104 and dispensing unit 106.
In one embodiment, collecting unit 102, may be used for gathering the big data in the Internet.This collecting unit Concrete function and the process of 102 see step S110.
Such as: text datas these magnanimity of Internet, discrete, that come in every shape.
Alternatively, collecting unit 102, may include that acquisition module 1022 and memory module 1024.
In one example, acquisition module 1022, may be used for, by web crawlers technology, obtaining in the Internet with non-knot Structure the Internet text is main data.Concrete function and the process of this acquisition module 1022 see step S210.
Such as: magnanimity can be utilized current ripe climbing based on the data acquisition of destructuring the Internet text Worm technology, the Internet reptile is by extensively traveling through the network being made up of the link in website, and continuous print obtains a large amount of institutes The information needed, utilizes current ripe web crawlers theory can construct efficient web crawlers.
In one example, memory module 1024, may be used for, based on the described data obtained, building destructuring data Storehouse.Concrete function and the process of this memory module 1024 see step S220.
Such as: form destructuring information bank, after determining data source and scope of data, carry out the network information and crawl with non- Structuring information bank builds.
Such as: from data source, crawl name entity and relevant text interpretation, and the storage of its persistence is arrived In mongodb unstructured data storehouse.
In one example, carry out the network information to crawl and destructuring information bank structure.Python instrument is used to write Crawlers, crawls name entity and relevant text interpretation from data source, and by the storage of its persistence to mongodb In unstructured data storehouse, system crawlers uses python scrapy instrument to realize, it is necessary first under nonspecific pattern It is simulated logging in, and the automatic switchover useragent that is in due course, automatic switchover IP agency the most when being necessary, For availability and the reliability of safeguards system, for the unusual condition that some are serious, the system contained by the present invention possesses automatically Sending the function of mail notification, all of program operation process all can be preserved with the form of daily record.Scrapy (i.e. Python One of exploitation is quick, high-level screen scraping and web crawl framework, is used for capturing web site and extracting knot from the page The data of structure) employ storehouse, twisted asynchronous networking to process network communication, overall architecture is as shown in Figure 1.
Wherein, Scrapy mainly includes following assembly:
(1) engine (Scrapy Engine), is used for processing the Data Stream Processing of whole system, triggers affairs.
(2) scheduler (Scheduler), is used for accepting the request that engine sends, in press-in queue, and at engine again Return the when of request.
(3) downloader (Downloader), is used for downloading web page contents, and web page contents returns to reptile framework.
(4) reptile framework (Spider), is used for working out the resolution rules of certain domain name or webpage.
(5) project pipeline (Item Pipeline), responsible process has the project that reptile framework extracts from webpage, his master Task is wanted to be cleaning, verify and store data.After the page is by reptile Analysis on Framework, project pipeline will be sent to, and pass through Several specific order process data.
(6) downloader middleware (Downloader Middlewares), the hook between Scrapy engine and downloader Subframe, mainly processes the request between Scrapy engine and downloader and response.
(7) reptile middleware (Spider Middlewares), the hook frame between Scrapy engine and reptile framework Frame, groundwork is response input and the request output processing reptile framework.
(8) dispatch middleware (Scheduler Middlewares), the middleware between Scrapy engine and scheduling, Request and the response of scheduling it is sent to from Scrapy engine.
Thus, by crawling big data and storing, can be that the offer of the process to big data precisely and reliably depends on According to, beneficially improve the high efficiency that big data are processed and convenience.
In one embodiment, structuring unit 104, may be used for described big data are carried out structuring process.Should Concrete function and the process of structuring unit 104 see step S120.
Such as: build the distributed extraction framework of an iteration.
Alternatively, structuring unit 104, may include that creation module 1042, formatting module 1044 and clean and coding Module 1046.
In one example, creation module 1042, may be used for according to the described data collected and the structuring of presetting Datum target, creates data model.Concrete function and the process of this creation module 1042 see step S310.
Such as: data analysis, for the application target of data the industry field association attributes that combines specialty, knot is made The datum target of structure.
Such as: data modeling, in conjunction with the unstructured data crawled and set structural data target, use PowerDesigner instrument creates data model.
In one example, formatting module 1044, may be used for, by described data model, extracting in described data Unstructured data, and described unstructured data is carried out preliminary formatting process.The concrete merit of this formatting module 1044 And can process and see step S320.
Such as: data are tentatively formatted, as it is shown in fig. 7, use the non-structural in ETL instrument extraction mongodb Change text data, and be stored to the data center storehouse built with mysql5.7 with relevant formatting method and timed task In the STAGE of storehouse, data now are through the preliminary unstructured data formatted.
In one example, clean and coding module 1046, may be used for the non-knot after described preliminary structureization is processed Structure data carry out data cleansing and Unified coding processes, and obtain required structural data.This cleaning and coding module 1046 Concrete function and process see step S330.
In one example, structural data, as it is shown in fig. 7, use thinking based on natural language processing, by writing Storing process the text (non-structured) formatted is carried out data cleansing, Unified coding, and coordinate ETL instrument, Crontab timed task, by its automated storing to data warehouse ODS layer, now data have possessed structurized characteristic, in order to Reach this target, mainly use hidden Markov model and the word cartographical representation destructuring to magnanimity altogether in this stage Data are carried out.
Such as: hidden Markov model and word cartographical representation altogether can be used, the identification of name entity will be effectively improved Efficiency, in conjunction with ID3 algorithm, structure that therefrom extract most worthy, that the most standardized, the data wall scroll scale of construction minimizes Change data, provide valid data convenient, flexible, accurate, manageable to support for subsequent applications.
Such as: hidden Markov model, in natural language processing, it is applied to word segmentation and the part-of-speech tagging side of Chinese Face, mainly for unordered unstructured text data.In part-of-speech tagging, part of speech sequence is hiding at mark money, is to need Target to be solved, given word string is then the sequence of observable symbol, is known condition before mark.If part-of-speech tagging Problem model is understood as a HMM, then (status number of HMM determines that) that the set of part-of-speech tagging determines that, each part of speech Corresponding word determines that, in dictionary, each word has one or several part of speech labellings determined.At hidden Ma Er Can be under husband's model, part-of-speech tagging problem can be expressed as: at given word (observed value) W=(w1, w2, w3, w4... wm) sequence, ask Part of speech (state) sequence T=(t that possible row is maximum1, t2, t3, t4..., tm) make conditional probability P (T | W) maximum.P (T | W) mesh Front being difficult to is estimated, general employing Bayes changes, it may be assumed that
P ( T | W ) = P ( T ) P ( W | T ) P ( W )
In part-of-speech tagging, W is given, and P (W) does not relies on T, once when calculating P (T | W), can not consider P (W), application joint probability formula P (A, B)=P (A) P simultaneously (B | A) have:
P (T | W)=P (T) P (W | T)
Applied probability multiplication formula further to above formula, has:
P ( T | W ) = P ( w 1 , ... , m t 1 , ... , m ) = Π i = 1... m P ( W i , t i | W 1 , ... , i - 1 , t 1 , ... , i - 1 ) = Π i = 1... m P ( W i | W 1 , ... , i - 1 , t 1 , ... , i - 1 ) x P ( t i | W 1 , ... , i - 1 , t 1 , ... , i - 1 ) .
W in above formula1..., i=W1, W2... Wi, t1... i==t1, t2... ti, 1≤i≤m.
Word altogether cartographical representation, when two words are adjacent occur time, claim the co-occurrence word of the two word the other side each other.Assume In a large amount of language materials, all different morphologies become dictionary, D=[W0, W1, W2..., Wn].Each word Wi is expressed as a vectorial F (W) =[F0, F1, F2..., Fn].Fi represents that whether word Wi is the co-occurrence word of W.Any two word is if there is similar In context, then it is assumed that they are semantic relevant.Any two word Wi, WjThe number that degree of association is its common co-occurrence word, I.e. vector F (Wi)·F(Wj).For convenience's sake, the figure that the method is constituted becomes One-Hot co-occurrence figure.Only consider identical Co-occurrence word number can not effectively carry out the comparison of semantic dependency, and this method does not accounts between each word Co-occurrence degree.The number of times of two word co-occurrences is the most, and two words are more likely to be commonly used collocation.Semanteme in commonly used collocation is right In comparing whether two words are correlated with extremely important.Such as with a lot of verb collocation before " time ", such as, " set up ", " exiting ", " typing ", " investment ", etc., but " investment " has extremely abundant context, should not have identical power with " establishment " Weight.In Chinese, word uses very flexible, and the context of major part word is widely distributed, and the method only by counting can produce A large amount of noises.Therefore, word analogy document, used the method for similar Tf-Idf to calculate the weight on limit, claimed the party's Vinculum iuris structure The figure made is Tf-Idf co-occurrence figure.Formula is as follows:
Make dictionary D=[W0, W1, W2..., Wn], make FC=[FC0, FC1, FC2..., FCn] represent the co-occurrence of each word The number of word.
Make TF (W)=[TF0, TF1, TF2..., TFn], TFiRepresent the co-occurrence number of times of each word and W.Obviously
Make TfIdf (W)=[TfIdf0, TfIdf1, TfIdf2 ..., TfIdfn], TfIdfi represents each word and W Tightness degree.According to the computing formula of TfIdf, there is TfIdfi=TFi*log (N+1/FC1)。
Thus, by unstructured data is processed into structural data, can be to storages based on big data and use Thering is provided reliable form, hommization is good.
In one embodiment, dispensing unit 106, may be used for described structuring is processed the structural data obtained Use.Concrete function and the process of this dispensing unit 106 see step S130.
In one example, complete these resources based on unstructured data are automatically resolved and managed, from And realize the extraction of destructuring name entity and corresponded to the relation mould of structuring name entity by modeling technique Type network.
In one example, data depth resolves, as it is shown in fig. 7, introduce ID3 algorithm to refine valuable data further Storage is to data warehouse DW layer, and in order to refine simple and direct, the data of most worthy, the present invention has selected ID3 algorithm to this herein A little data having possessed structural features carry out deep analysis and extraction, and method is as follows:
ID3 algorithm, from theory of information knowledge, user obtains, it is desirable to information is the least, and information gain is the biggest, thus purity is more High.So the core concept of ID3 algorithm is exactly to measure Attributions selection with information gain, the genus that after selecting to divide, information gain is maximum Property divides.The most first define several concept to be used.
If D is the division carried out training tuple by classification, wherein pi represents that i-th classification goes out in whole training tuple Existing probability, can be with belonging to the quantity of this class elements divided by training tuple elements total quantity as estimation.The actual meaning of entropy Justice represents the average information required for the class label being tuple in D.User assumes to carry out training tuple D by attribute A now Divide, then the expectation information that D is divided by A is: and information gain is both differences:
Gain (A)=inf o (D)-inf oA(D)。
ID3 algorithm, exactly when needing division every time, calculates the ratio of profit increase of each attribute, then selects ratio of profit increase maximum Attribute divides.
User illustrates how to use ID3 algorithm construction decision tree with the example of infomation detection industrial and commercial in investment institution below. For simplicity, user assumes that industrial and commercial information comprises 10 elements:
Wherein s, m and l represent little respectively, neutralize greatly.If whether L, F, H and R represent daily record density, good friend's density, use True head portrait and account are the truest, then the information gain formula calculating each attribute is:
Inf o (D)=-0.7log20.7-0.3log20.3=0.7*0.51+0.3*1.74=0.879
Gain (L)=0.879-0.603=0.276.
Data value, as it is shown in fig. 7, combine professional field valueization demand, by the structural data of relatively fine particle degree Carry out various dimensions aggregation process and store data warehouse DM/EDW layer.
Java instrument customization api interface is used based on the data warehouse DM/EDW layer shown in Fig. 2.
Use bootstrap front end Development Framework, echarts icon library, jqueryjavascript storehouse exploitation form system The valuable data that have that are the most extracted to these and that collect out of uniting carry out the encapsulation of various dimensions and represent, and are finally embodied as certainly Plan provides the purpose of important evidence.
Alternatively, dispensing unit 106, may include that division module 1062, extraction module 1064 and application module 1066 At least one.
In one example, described division module 1062, may be used for according to predetermined gain ratio step, to described structural data Carrying out information gain tolerance Attributions selection, the attribute selecting information gain tolerance attribute to exceed described ratio of profit increase divides.Should Concrete function and the process of division module 1062 see step S410.
In one example, described extraction module 1064, may be used for meeting granularity after division less than presetting fineness Structural data carry out various dimensions aggregation process, and extract the structural data being met default dimension.This extraction module Concrete function and the process of 1064 see step S420.
Thus, by the division of structural data and extraction process, so that structural data becomes small data, just In storage, it is simple to transmission, it is simple to use, and it is little to take up room, and is conducive to economizing on resources.
In one example, described application module 1066, may be used for the described structural data that described extraction is obtained, The encapsulation carrying out various dimensions and at least one operation represented.Concrete function and the process of this application module 1066 see step S430。
In one example, (1) system general levels structure is shown in Fig. 7.
(2) data structure, the present invention has selected powerDesigner instrument to carry out Data Structure Design.
(3) selected hidden Markov model, using storing process as implementation method, for unordered non-structured text Data, by using this model and implementation method to carry out word segmentation and two aspects of part-of-speech tagging of Chinese in name entity Resolve.
Under hidden Markov model, part-of-speech tagging problem can be expressed as: at given word (observed value) W=(w1, w2, w3, w4... wm) sequence, seek part of speech (state) sequence T=(t that possible row is maximum1, t2, t3, t4..., tm) make conditional probability P (T | W) maximum.P (T | W) it is difficult at present estimate, general employing Bayes changes, it may be assumed that
P ( T | W ) = P ( T ) P ( W | T ) P ( W )
In part-of-speech tagging, W is given, and P (W) does not relies on T, once when calculating P (T | W), can not consider P (W), application joint probability formula P (A, B)=P (A) P simultaneously (B | A) have
P (T | W)=P (T) P (W | T)
Applied probability multiplication formula further to above formula, has
P ( T | W ) = P ( w 1 , ... , m t 1 , ... , m ) = Π i = 1... m P ( W i , t i | W 1 , ... , i - 1 , t 1 , ... , i - 1 ) = Π i = 1... m P ( W i | W 1 , ... , i - 1 , t 1 , ... , i - 1 ) x P ( t i | W 1 , ... , i - 1 , t 1 , ... , i - 1 )
W in formula1..., i=W1, W2... Wi, t1... i==t1, t2... ti, 1≤i≤m
(4) select word cartographical representation altogether, resolved by the text data realized in name entity that arranges of weight, Implementation method is as follows:
Word altogether cartographical representation, when two words are adjacent occur time, claim the co-occurrence word of the two word the other side each other.Assume In a large amount of language materials, all different morphologies become dictionary, D=[W0, W1, W2..., Wn].Each word Wi is expressed as a vectorial F (W) =[F0, F1, F2..., Fn].Fi represents that whether word Wi is the co-occurrence word of W.Any two word is if there is similar In context, then it is assumed that they are semantic relevant.Any two word Wi, WjThe number that degree of association is its common co-occurrence word, I.e. vector F (Wi)·F(Wj).For convenience's sake, the figure that the method is constituted becomes One-Hot co-occurrence figure.Only consider identical Co-occurrence word number can not effectively carry out the comparison of semantic dependency, and this method does not accounts between each word Co-occurrence degree.The number of times of two word co-occurrences is the most, and two words are more likely to be commonly used collocation.Semanteme in commonly used collocation is right In comparing whether two words are correlated with extremely important.Such as " time " above with a lot of verb collocation, such as " set up ", " exiting ", " typing ", " investment " etc., but " investment " has extremely abundant context, should not have identical power with " establishment " Weight.In Chinese, word uses very flexible, and the context of major part word is widely distributed, and the method only by counting can produce A large amount of noises.Therefore, word analogy document, used the method for similar Tf-Idf to calculate the weight on limit, claimed the party's Vinculum iuris structure The figure made is Tf-Idf co-occurrence figure.Formula is as follows:
Make dictionary D=[W0, W1, W2..., Wn], make FC=[FC0, FC1, FC2..., FCn] represent the co-occurrence of each word The number of word.
Make TF (W)=[TF0, TF1, TF2..., TFn], TFiRepresent the co-occurrence number of times of each word and W.Obviously
Make TfIdf (W)=[TfIdf0, TfIdf1, TfIdf2 ..., TfIdfn], TfIdfi represents each word and W Tightness degree.According to the computing formula of TfIdf, there is TfIdfi=TFi*log (N+1/FC1)。
(5) select ID3 algorithm to realize the managing of data, value, be easy to the purpose applied.Use from theory of information knowledge Family obtains, it is desirable to information is the least, and information gain is the biggest, thus purity is the highest.So the core concept of ID3 algorithm is exactly with information Gain tolerance Attributions selection, the attribute that after selecting division, information gain is maximum divides.The most first define several to be used Concept.If D is the division carried out training tuple by classification, wherein pi represents that i-th classification occurs in whole training tuple Probability, can be with belonging to the quantity of this class elements divided by training tuple elements total quantity as estimation.The practical significance of entropy Expression is the average information required for the class label of tuple in D.User assumes to carry out drawing by attribute A by training tuple D now Point, then the expectation information that D is divided by A is: and information gain is both differences:
Gain (A)=inf o (D)-inf oA(D)。
ID3 algorithm, exactly when needing division every time, calculates the ratio of profit increase of each attribute, then selects ratio of profit increase maximum Attribute divides.
User illustrates how to use ID3 algorithm construction decision tree with the example of infomation detection industrial and commercial in investment institution below. For simplicity, it can be assumed that industrial and commercial information comprises 10 elements:
Wherein s, m and l represent little respectively, neutralize greatly.If whether L, F, H and R represent daily record density, good friend's density, use True head portrait and account are the truest, then the information gain formula calculating each attribute is:
Inf o (D)=-0.7log20.7-0.3log20.3=0.7*0.51+0.3*1.74=0.879
Gain (L)=0.879-0.603=0.276.
Thus, by the application to the structural data after division and extraction, the utilization rate of big data, environmental protection can be promoted Property good, resource utilization is high.
The process realized due to the device of the present embodiment and function essentially correspond to the method shown in earlier figures 1 to Fig. 4 Embodiment, principle and example, therefore the most detailed part in the description of the present embodiment, may refer to speaking on somebody's behalf mutually in previous embodiment Bright, do not repeat at this.
Through substantial amounts of verification experimental verification, use technical scheme, by the non-knot in the Internet using this Technology design Big data can be carried out structuring, and then reduce by the extraction of structure numerical nomenclature entity and modeling method and application system thereof Memory space, saves storage resource.
According to embodiments of the invention, additionally provide a kind of terminal of processing means corresponding to big data.This terminal can To include: the processing means of above-described big data.
In one embodiment, the processing procedure of the big data of this terminal, can be the Internet unstructured data life The extraction of name entity and modeling method, it is simply that text datas these magnanimity of Internet, discrete, that come in every shape, structure Build the distributed extraction framework of an iteration, and complete these resources based on unstructured data are resolved automatically and Management, thus realize the extraction of destructuring name entity and corresponded to structuring name entity by modeling technique Relational model network, makes these at ordinary times by the data energy of people's " all kinds of things in nature of extorting excessive taxes and levies " eventually through application system with structurized form Enough bring huge decision value for enterprise.
Name entity relation based on the Internet extracts (Web-based Entity RelationExtraction) and is working as The research direction being increasingly becoming a great potential today of front Internet technology high speed development, from a huge language Material storehouse is sought the relation lain between different name entity and carries out storing with structurized form and utilization is a tool Challenging and research highly significant, it is the crowd of natural language processing (Natural Language Processing) Multi-field all have a wide range of applications, such as information retrieval (Information Retrieval), question answering system (Question Answering), semantic search (SemanticSearch) and text mining (Textual Mining) etc..The row of name entity Discrimination, as the elimination of inter-entity ambiguity of the same name in name entity, is to make relation extraction the most accurate thus Semantic-Oriented aspect Essential step, make the extraction of relation for more evolving to meaning representated by entity itself from literal aspect before Concern so that the relation between entity is more firmly and credible.
There is application, particularly in view of in hidden Markov model and word cartographical representation process in natural language altogether more For unordered, non-structured text data, use this model and method will to be effectively improved the recognition efficiency of name entity, In conjunction with ID3 algorithm, structuring number that therefrom extract most worthy, that the most standardized, the data wall scroll scale of construction minimizes According to, provide valid data convenient, flexible, accurate, manageable to support for subsequent applications.
The developing rapidly of Distributed Calculation theory and technology has promoted substantial amounts of to mass data progress of research, these reasons Be with historically new significance most in opinion and technology is MapReduce computation module and Hadoop framework, utilizes them, can be with structure Building out the distributed computing framework of Highly Scalable flexibly, the present invention is just being made by MapReduce computation module and Hadoop frame Frame constructed one can stable operation distributed entities relation extract framework.
In one example, magnanimity can be able to be utilized currently based on the data acquisition of destructuring the Internet text Ripe crawler technology, the Internet reptile is by extensively traveling through the network being made up of the link in website, and continuous print obtains Obtain information required in a large number, utilize current ripe web crawlers theory can construct efficient web crawlers.
The carrying out of this research of developing into of Chinese text processing technology has established solid foundation, such as hidden Markov mould Type, word cartographical representation altogether and ID3 algorithm all have more ripe solution at academic circles at present and industrial quarters, manage for these The grasp of opinion and technology and the basis that application is that the present invention carries out smoothly.
See the example shown in Fig. 6 and Fig. 7, in the present invention, the extracting method of unstructured data name entity, including Following steps:
Form destructuring information bank, after determining data source and scope of data, carry out the network information and crawl and non-structural Change information bank to build.Use python instrument to write crawlers, from data source, crawl name entity and relevant text Explain, and by the storage of its persistence to mongodb unstructured data storehouse, system crawlers uses python Scrapy instrument realizes, it is necessary first to be simulated logging under nonspecific pattern, and the automatic switchover that is in due course Useragent, the most when being necessary automatic switchover IP agency, for availability and the reliability of safeguards system, for one The most serious unusual condition, the system contained by the present invention possesses the function automatically sending mail notification, and all of program was run Cheng Junhui is preserved with the form of daily record.Scrapy employs storehouse, twisted asynchronous networking to process network communication, integrated stand Structure is as shown in Figure 1.
Data analysis, for the application target of data the industry field association attributes that combines specialty, makes structuring Datum target.
Data modeling, in conjunction with the unstructured data crawled and set structural data target, uses power Designer instrument creates data model.
Data are tentatively formatted, as it is shown in fig. 7, use the non-structured text in ETL instrument extraction mongodb Data, and the data center warehouse built with mysql5.7 it is stored to relevant formatting method and timed task In STAGE, data now are through the preliminary unstructured data formatted.
Structural data, as it is shown in fig. 7, use thinking based on natural language processing, by the storing process pair write The text (non-structured) formatted carries out data cleansing, Unified coding, and coordinates ETL instrument, crontab timed task, By its automated storing to data warehouse ODS layer, now data have possessed structurized characteristic, in order to reach this target, This stage has mainly used hidden Markov model and word cartographical representation to be altogether carried out the unstructured data of magnanimity, its Act on as follows:
Hidden Markov model, in terms of natural language processing is applied to word segmentation and the part-of-speech tagging of Chinese, main Will be for unordered unstructured text data.In part-of-speech tagging, part of speech sequence is hiding at mark money, is to need to solve Target, given word string is then the sequence of observable symbol, is known condition before mark.If part-of-speech tagging problem mould Type is understood as a HMM, then (status number of HMM determines that) that the set of part-of-speech tagging determines that, corresponding to each part of speech Word determine that, in dictionary, each word has one or several part of speech labellings determined.At hidden Markov mould Under type, part-of-speech tagging problem can be expressed as: at given word (observed value) W=(w1, w2, w3, w4... wm) sequence, asking may row Maximum part of speech (state) sequence T=(t1, t2, t3, t4..., tm) make conditional probability P (T | W) maximum.P (T | W) it is difficult at present Estimating, general employing Bayes changes, it may be assumed that
P ( T | W ) = P ( T ) P ( W | T ) P ( W )
In part-of-speech tagging, W is given, and P (W) does not relies on T, once when calculating P (T | W), can not consider P (W), application joint probability formula P (A, B)=P (A) P simultaneously (B | A) have:
P (T | W)=P (T) P (W | T)
Applied probability multiplication formula further to above formula, has:
P ( T | W ) = P ( w 1 , ... , m t 1 , ... , m ) = Π i = 1... m P ( W i , t i | W 1 , ... , i - 1 , t 1 , ... , i - 1 ) = Π i = 1... m P ( W i | W 1 , ... , i - 1 , t 1 , ... , i - 1 ) x P ( t i | W 1 , ... , i - 1 , t 1 , ... , i - 1 ) .
W in above formula1..., i=W1, W2... Wi, t1... i==t1, t2... ti, 1≤i≤m.
Word altogether cartographical representation, when two words are adjacent occur time, claim the co-occurrence word of the two word the other side each other.Assume In a large amount of language materials, all different morphologies become dictionary, D=[W0, W1, W2..., Wn].Each word Wi is expressed as a vectorial F (W) =[F0, F1, F2..., Fn].Fi represents that whether word Wi is the co-occurrence word of W.Any two word is if there is similar In context, then it is assumed that they are semantic relevant.Any two word Wi, WjThe number that degree of association is its common co-occurrence word, I.e. vector F (Wi)·F(Wj).For convenience's sake, the figure that the method is constituted becomes One-Hot co-occurrence figure.Only consider identical Co-occurrence word number can not effectively carry out the comparison of semantic dependency, and this method does not accounts between each word Co-occurrence degree.The number of times of two word co-occurrences is the most, and two words are more likely to be commonly used collocation.Semanteme in commonly used collocation is right In comparing whether two words are correlated with extremely important.Such as with a lot of verb collocation before " time ", such as, " set up ", " exiting ", " typing ", " investment ", etc., but " investment " has extremely abundant context, should not have identical power with " establishment " Weight.In Chinese, word uses very flexible, and the context of major part word is widely distributed, and the method only by counting can produce A large amount of noises.Therefore, word analogy document, used the method for similar Tf-Idf to calculate the weight on limit, claimed the party's Vinculum iuris structure The figure made is Tf-Idf co-occurrence figure.Formula is as follows:
Make dictionary D=[W0, W1, W2..., Wn], make FC=[FC0, FC1, FC2..., FCn] represent the co-occurrence of each word The number of word.
Make TF (W)=[TF0, TF1, TF2..., TFn], TFiRepresent the co-occurrence number of times of each word and W.Obviously
Make TfIdf (W)=[TfIdf0, TfIdf1, TfIdf2 ..., TfIdfn], TfIdfi represents each word and W Tightness degree.According to the computing formula of TfIdf, there is TfIdfi=TFi*log (N+1/FC1)。
Data depth resolves, as it is shown in fig. 7, introducing ID3 algorithm refines valuable data further and stores data bins Storehouse DW layer, in order to refine simple and direct, the data of most worthy, the present invention has selected ID3 algorithm to possess structure to these herein The data changing characteristic carry out deep analysis and extraction, and method is as follows:
ID3 algorithm, from theory of information knowledge, user obtains, it is desirable to information is the least, and information gain is the biggest, thus purity is more High.So the core concept of ID3 algorithm is exactly to measure Attributions selection with information gain, the genus that after selecting to divide, information gain is maximum Property divides.The most first define several concept to be used.
If D is the division carried out training tuple by classification, wherein pi represents that i-th classification goes out in whole training tuple Existing probability, can be with belonging to the quantity of this class elements divided by training tuple elements total quantity as estimation.The actual meaning of entropy Justice represents the average information required for the class label being tuple in D.User assumes to carry out training tuple D by attribute A now Divide, then the expectation information that D is divided by A is: and information gain is both differences:
Gain (A)=inf o (D)-inf oA(D)。
ID3 algorithm, exactly when needing division every time, calculates the ratio of profit increase of each attribute, then selects ratio of profit increase maximum Attribute divides.
User illustrates how to use ID3 algorithm construction decision tree with the example of infomation detection industrial and commercial in investment institution below. For simplicity, user assumes that industrial and commercial information comprises 10 elements:
Wherein s, m and l represent little respectively, neutralize greatly.If whether L, F, H and R represent daily record density, good friend's density, use True head portrait and account are the truest, then the information gain formula calculating each attribute is:
Inf o (D)=-0.7log20.7-0.3log20.3=0.7*0.51+0.3*1.74=0.879
Gain (L)=0.879-0.603=0.276.
Data value, as it is shown in fig. 7, combine professional field valueization demand, by the structural data of relatively fine particle degree Carry out various dimensions aggregation process and store data warehouse DM/EDW layer.
Java instrument customization api interface is used based on the data warehouse DM/EDW layer shown in Fig. 2.
Use bootstrap front end Development Framework, echarts icon library, jqueryjavascript storehouse exploitation form system The valuable data that have that are the most extracted to these and that collect out of uniting carry out the encapsulation of various dimensions and represent, and are finally embodied as certainly Plan provides the purpose of important evidence.
In an optional embodiment, the big data handling procedure of this terminal, may include that
Step 1, data acquisition.
Carry out the network information to crawl and destructuring information bank structure.Python instrument is used to write crawlers, from number According to crawling name entity and relevant text interpretation on source, and by the storage of its persistence to mongodb unstructured data In storehouse, system crawlers uses python scrapy instrument to realize, it is necessary first to be simulated stepping under nonspecific pattern Record, and the automatic switchover useragent that is in due course, the most when being necessary automatic switchover IP agency, in order to ensure system The availability of system and reliability, for the unusual condition that some are serious, the system contained by the present invention possesses transmission mail automatically and leads to The function known, all of program operation process all can be preserved with the form of daily record.Scrapy (i.e. of Python exploitation Quickly, high-level screen scraping and web crawl framework, be used for capturing web site and from the page, extracting structurized data) Employing storehouse, twisted asynchronous networking to process network communication, overall architecture is as shown in Figure 1.
Wherein, Scrapy mainly includes following assembly:
(1) engine (Scrapy Engine), is used for processing the Data Stream Processing of whole system, triggers affairs.
(2) scheduler (Scheduler), is used for accepting the request that engine sends, in press-in queue, and at engine again Return the when of request.
(3) downloader (Downloader), is used for downloading web page contents, and web page contents returns to reptile framework.
(4) reptile framework (Spider), is used for working out the resolution rules of certain domain name or webpage.
(5) project pipeline (Item Pipeline), responsible process has the project that reptile framework extracts from webpage, his master Task is wanted to be cleaning, verify and store data.After the page is by reptile Analysis on Framework, project pipeline will be sent to, and pass through Several specific order process data.
(6) downloader middleware (Downloader Middlewares), the hook between Scrapy engine and downloader Subframe, mainly processes the request between Scrapy engine and downloader and response.
(7) reptile middleware (Spider Middlewares), the hook frame between Scrapy engine and reptile framework Frame, groundwork is response input and the request output processing reptile framework.
(8) dispatch middleware (Scheduler Middlewares), the middleware between Scrapy engine and scheduling, Request and the response of scheduling it is sent to from Scrapy engine.
Step 2, data model and algorithm.
(1) system general levels structure is shown in Fig. 7.
(2) data structure, the present invention has selected powerDesigner instrument to carry out Data Structure Design.
(3) selected hidden Markov model, using storing process as implementation method, for unordered non-structural
Change text data, by use this model and implementation method to name entity in Chinese word segmentation and
Two aspects of part-of-speech tagging are resolved.
Under hidden Markov model, part-of-speech tagging problem can be expressed as: at given word (observed value) W=(w1, w2, w3, w4... wm) sequence, seek part of speech (state) sequence T=(t that possible row is maximum1, t2, t3, t4..., tm) make conditional probability P (T | W) maximum.P (T | W) it is difficult at present estimate, general employing Bayes changes, it may be assumed that
P ( T | W ) = P ( T ) P ( W | T ) P ( W )
In part-of-speech tagging, W is given, and P (W) does not relies on T, once when calculating P (T | W), can not consider P (W), application joint probability formula P (A, B)=P (A) P simultaneously (B | A) have
P (T | W)=P (T) P (W | T)
Applied probability multiplication formula further to above formula, has
P ( T | W ) = P ( w 1 , ... , m t 1 , ... , m ) = Π i = 1... m P ( W i , t i | W 1 , ... , i - 1 , t 1 , ... , i - 1 ) = Π i = 1... m P ( W i | W 1 , ... , i - 1 , t 1 , ... , i - 1 ) x P ( t i | W 1 , ... , i - 1 , t 1 , ... , i - 1 )
W in formula1..., i=W1, W2... Wi, t1... i==t1, t2... ti, 1≤i≤m
(4) select word cartographical representation altogether, carried out by the text data realized in name entity that arranges of weight
Resolve, it is achieved method is as follows:
Word altogether cartographical representation, when two words are adjacent occur time, claim the co-occurrence word of the two word the other side each other.Assume In a large amount of language materials, all different morphologies become dictionary, D=[W0, W1, W2..., Wn].Each word Wi is expressed as a vectorial F (W) =[F0, F1, F2..., Fn].Fi represents that whether word Wi is the co-occurrence word of W.Any two word is if there is similar In context, then it is assumed that they are semantic relevant.Any two word Wi, WjThe number that degree of association is its common co-occurrence word, I.e. vector F (Wi)·F(Wj).For convenience's sake, the figure that the method is constituted becomes One-Hot co-occurrence figure.Only consider identical Co-occurrence word number can not effectively carry out the comparison of semantic dependency, and this method does not accounts between each word Co-occurrence degree.The number of times of two word co-occurrences is the most, and two words are more likely to be commonly used collocation.Semanteme in commonly used collocation is right In comparing whether two words are correlated with extremely important.Such as " time " above with a lot of verb collocation, such as " set up ", " exiting ", " typing ", " investment " etc., but " investment " has extremely abundant context, should not have identical power with " establishment " Weight.In Chinese, word uses very flexible, and the context of major part word is widely distributed, and the method only by counting can produce A large amount of noises.Therefore, word analogy document, used the method for similar Tf-Idf to calculate the weight on limit, claimed the party's Vinculum iuris structure The figure made is Tf-Idf co-occurrence figure.Formula is as follows:
Make dictionary D=[W0, W1, W2..., Wn], make FC=[FC0, FC1, FC2..., FCn] represent the co-occurrence of each word The number of word.
Make TF (W)=[TF0, TF1, TF2..., TFn], TFiRepresent the co-occurrence number of times of each word and W.Obviously
Make TfIdf (W)=[TfIdf0, TfIdf1, TfIdf2 ..., TfIdfn], TfIdfi represents each word and W Tightness degree.According to the computing formula of TfIdf, there is TfIdfi=TFi*log (N+1/FC1)。
(5) select ID3 algorithm to realize the managing of data, value, be easy to the purpose applied.Use from theory of information knowledge Family obtains, it is desirable to information is the least, and information gain is the biggest, thus purity is the highest.So the core concept of ID3 algorithm is exactly with information Gain tolerance Attributions selection, the attribute that after selecting division, information gain is maximum divides.The most first define several to be used Concept.If D is the division carried out training tuple by classification, wherein pi represents that i-th classification occurs in whole training tuple Probability, can be with belonging to the quantity of this class elements divided by training tuple elements total quantity as estimation.The practical significance of entropy Expression is the average information required for the class label of tuple in D.User assumes to carry out drawing by attribute A by training tuple D now Point, then the expectation information that D is divided by A is: and information gain is both differences:
Gain (A)=inf o (D)-inf oA(D)。
ID3 algorithm, exactly when needing division every time, calculates the ratio of profit increase of each attribute, then selects ratio of profit increase maximum Attribute divides.
User illustrates how to use ID3 algorithm construction decision tree with the example of infomation detection industrial and commercial in investment institution below. For simplicity, it can be assumed that industrial and commercial information comprises 10 elements:
Wherein s, m and l represent little respectively, neutralize greatly.If whether L, F, H and R represent daily record density, good friend's density, use True head portrait and account are the truest, then the information gain formula calculating each attribute is:
Inf o (D)=-0.7log20.7-0.3log20.3=0.7*0.51+0.3*1.74=0.879
Gain (L)=0.879-0.603=0.276.
The process realized due to the terminal of the present embodiment and function essentially correspond to the enforcement of the device shown in earlier figures 5 Example, principle and example, therefore the most detailed part in the description of the present embodiment, may refer to the related description in previous embodiment, This does not repeats.
Through substantial amounts of verification experimental verification, use technical scheme, by being applicable to the Internet non-structured text number According to name entity extraction based on natural language processing (such as: hidden Markov model, word altogether cartographical representation and ID3 algorithm) With modeling method and application system, in combination with Hadoop (be i.e. one by the distributed system of Apache fund club exploitation System architecture) distributed computing technology such that it is able to preferably adapt to huge, the discrete data characteristics in the Internet and keep flexibly Autgmentability.
To sum up, skilled addressee readily understands that, on the premise of not conflicting, above-mentioned each advantageous manner can be certainly By ground combination, superposition.
The foregoing is only embodiments of the invention, be not limited to the present invention, for those skilled in the art For Yuan, the present invention can have various modifications and variations.All within the spirit and principles in the present invention, any amendment of being made, Equivalent, improvement etc., within should be included in scope of the presently claimed invention.

Claims (10)

1. the processing method of big data, it is characterised in that including:
Gather the big data in the Internet;
Described big data are carried out structuring process;
Described structuring is processed the structural data obtained use.
Method the most according to claim 1, it is characterised in that gather the big data in the Internet, including:
By web crawlers technology, obtain the data based on destructuring the Internet text in the Internet;
Based on the described data obtained, build destructuring information bank.
Method the most according to claim 1 and 2, it is characterised in that described big data are carried out structuring process, including:
According to the described data collected and the structural data target preset, create data model;
By described data model, extract the unstructured data in described data, and at the beginning of described unstructured data is carried out Step formatting processes;
Unstructured data after processing described preliminary structureization carries out data cleansing and Unified coding processes, and obtains required Structural data.
4. according to the method one of claim 1-3 Suo Shu, it is characterised in that described structuring to be processed the structuring number obtained According to using, including:
According to predetermined gain ratio step, described structural data is carried out information gain tolerance Attributions selection, selects information gain tolerance Attribute exceedes the attribute of described ratio of profit increase and divides;
Granularity after division is met and carries out various dimensions aggregation process less than the structural data presetting fineness, and extraction is expired Foot presets the structural data of dimension.
Method the most according to claim 4, it is characterised in that described structuring is processed the structural data obtained and carries out Use, also include:
The described structural data that described extraction is obtained, the encapsulation carrying out various dimensions and at least one operation represented.
6. the processing means of big data, it is characterised in that including:
Collecting unit, for gathering the big data in the Internet;
Structuring unit, for carrying out structuring process to described big data;
Dispensing unit, uses for described structuring is processed the structural data obtained.
Device the most according to claim 6, it is characterised in that collecting unit, including:
Acquisition module, for by web crawlers technology, obtains the data based on destructuring the Internet text in the Internet;
Memory module, for based on the described data obtained, builds destructuring information bank.
8. according to the device described in claim 6 or 7, it is characterised in that structuring unit, including:
Creation module, for according to the described data collected and the structural data target of presetting, creating data model;
Formatting module, for by described data model, extracts the unstructured data in described data, and to described non-knot Structure data carry out preliminary formatting process;
Cleaning and coding module, the unstructured data after processing described preliminary structureization carries out data cleansing and unification Coded treatment, obtains required structural data.
9. according to the method one of claim 6-8 Suo Shu, it is characterised in that dispensing unit, including: division module, extraction mould Block, application module at least one;Wherein,
Described division module, for according to predetermined gain ratio step, carries out information gain tolerance Attributions selection to described structural data, The attribute selecting information gain tolerance attribute to exceed described ratio of profit increase divides;
Described extraction module, carries out various dimensions collect for the structural data meeting granularity after division less than presetting fineness Process, and extract the structural data being met default dimension;
Described application module, for the described structural data that obtains described extraction, carries out the encapsulation of various dimensions and represents At least one operation.
10. a terminal, it is characterised in that including: the processing means of the big data as described in claim 6-9 is arbitrary.
CN201610643603.2A 2016-08-08 2016-08-08 Processing method, device and the terminal of a kind of big data Pending CN106227885A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610643603.2A CN106227885A (en) 2016-08-08 2016-08-08 Processing method, device and the terminal of a kind of big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610643603.2A CN106227885A (en) 2016-08-08 2016-08-08 Processing method, device and the terminal of a kind of big data

Publications (1)

Publication Number Publication Date
CN106227885A true CN106227885A (en) 2016-12-14

Family

ID=57548672

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610643603.2A Pending CN106227885A (en) 2016-08-08 2016-08-08 Processing method, device and the terminal of a kind of big data

Country Status (1)

Country Link
CN (1) CN106227885A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107273461A (en) * 2017-06-02 2017-10-20 广州诚予国际市场信息研究有限公司 A kind of natural language information processing method and system
CN107707419A (en) * 2017-03-21 2018-02-16 贵州白山云科技有限公司 A kind of method and apparatus for the internet development index for obtaining objective area
CN107908794A (en) * 2017-12-15 2018-04-13 广东工业大学 A kind of method of data mining, system, equipment and computer-readable recording medium
CN109241295A (en) * 2018-08-31 2019-01-18 北京天广汇通科技有限公司 A kind of extracting method of special entity relationship in unstructured data
CN109918428A (en) * 2019-01-17 2019-06-21 重庆金融资产交易所有限责任公司 Web data analytic method, device and computer readable storage medium
CN109947751A (en) * 2018-12-29 2019-06-28 医渡云(北京)技术有限公司 A kind of medical data processing method, device, readable medium and electronic equipment
CN109981632A (en) * 2018-12-20 2019-07-05 上海分布信息科技有限公司 Data value transmission method and data value Transmission system
CN110019225A (en) * 2017-12-21 2019-07-16 中国移动通信集团重庆有限公司 Method, apparatus, equipment and the medium of data processing
CN110134403A (en) * 2019-06-04 2019-08-16 厦门大学嘉庚学院 Configurable domain name mapping crawler frame and method based on asynchronous HTTP request
CN112364035A (en) * 2021-01-14 2021-02-12 零犀(北京)科技有限公司 Processing method and device for call record big data, electronic equipment and storage medium
CN112765442A (en) * 2018-06-25 2021-05-07 中译语通科技股份有限公司 Network emotion fluctuation index monitoring and analyzing method and system based on news big data

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103226618A (en) * 2013-05-21 2013-07-31 焦点科技股份有限公司 Related word extracting method and system based on data market mining
CN104573002A (en) * 2015-01-08 2015-04-29 浪潮通信信息系统有限公司 Data organization model for filing based on human, event and object
US9069853B2 (en) * 2007-03-30 2015-06-30 Innography, Inc. System and method of goal-oriented searching
CN104820670A (en) * 2015-03-13 2015-08-05 国家电网公司 Method for acquiring and storing big data of power information
CN104881424A (en) * 2015-03-13 2015-09-02 国家电网公司 Regular expression-based acquisition, storage and analysis method of power big data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9069853B2 (en) * 2007-03-30 2015-06-30 Innography, Inc. System and method of goal-oriented searching
CN103226618A (en) * 2013-05-21 2013-07-31 焦点科技股份有限公司 Related word extracting method and system based on data market mining
CN104573002A (en) * 2015-01-08 2015-04-29 浪潮通信信息系统有限公司 Data organization model for filing based on human, event and object
CN104820670A (en) * 2015-03-13 2015-08-05 国家电网公司 Method for acquiring and storing big data of power information
CN104881424A (en) * 2015-03-13 2015-09-02 国家电网公司 Regular expression-based acquisition, storage and analysis method of power big data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
徐澜: "数据仓库和数据挖掘在成人高校决策中的应用", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107707419A (en) * 2017-03-21 2018-02-16 贵州白山云科技有限公司 A kind of method and apparatus for the internet development index for obtaining objective area
CN107707419B (en) * 2017-03-21 2018-06-08 贵州白山云科技有限公司 A kind of method and apparatus for the internet development index for obtaining objective area
CN107273461A (en) * 2017-06-02 2017-10-20 广州诚予国际市场信息研究有限公司 A kind of natural language information processing method and system
CN107908794A (en) * 2017-12-15 2018-04-13 广东工业大学 A kind of method of data mining, system, equipment and computer-readable recording medium
CN110019225A (en) * 2017-12-21 2019-07-16 中国移动通信集团重庆有限公司 Method, apparatus, equipment and the medium of data processing
CN112765442A (en) * 2018-06-25 2021-05-07 中译语通科技股份有限公司 Network emotion fluctuation index monitoring and analyzing method and system based on news big data
CN109241295A (en) * 2018-08-31 2019-01-18 北京天广汇通科技有限公司 A kind of extracting method of special entity relationship in unstructured data
CN109241295B (en) * 2018-08-31 2021-12-24 北京天广汇通科技有限公司 Method for extracting specific entity relation in unstructured data
CN109981632B (en) * 2018-12-20 2021-04-02 上海分布信息科技有限公司 Data valuization transmission method and data valuization transmission system
CN109981632A (en) * 2018-12-20 2019-07-05 上海分布信息科技有限公司 Data value transmission method and data value Transmission system
CN109947751A (en) * 2018-12-29 2019-06-28 医渡云(北京)技术有限公司 A kind of medical data processing method, device, readable medium and electronic equipment
CN109918428A (en) * 2019-01-17 2019-06-21 重庆金融资产交易所有限责任公司 Web data analytic method, device and computer readable storage medium
CN110134403A (en) * 2019-06-04 2019-08-16 厦门大学嘉庚学院 Configurable domain name mapping crawler frame and method based on asynchronous HTTP request
CN110134403B (en) * 2019-06-04 2022-08-12 厦门大学嘉庚学院 Configurable domain name resolution crawler frame and method based on asynchronous HTTP request
CN112364035A (en) * 2021-01-14 2021-02-12 零犀(北京)科技有限公司 Processing method and device for call record big data, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN106227885A (en) Processing method, device and the terminal of a kind of big data
CN109189942B (en) Construction method and device of patent data knowledge graph
CN111143576A (en) Event-oriented dynamic knowledge graph construction method and device
CN108415953A (en) A kind of non-performing asset based on natural language processing technique manages knowledge management method
CN110347719A (en) A kind of enterprise's foreign trade method for prewarning risk and system based on big data
Babur et al. Hierarchical clustering of metamodels for comparative analysis and visualization
CN109344298A (en) A kind of method and device converting unstructured data to structural data
CN109840322A (en) It is a kind of based on intensified learning cloze test type reading understand analysis model and method
CN110019616A (en) A kind of POI trend of the times state acquiring method and its equipment, storage medium, server
CN110377751A (en) Courseware intelligent generation method, device, computer equipment and storage medium
Yahia et al. A new approach for evaluation of data mining techniques
CN108664512A (en) Text object sorting technique and device
CN114565053A (en) Deep heterogeneous map embedding model based on feature fusion
Zhang Application of data mining technology in digital library.
CN113946686A (en) Electric power marketing knowledge map construction method and system
Omri et al. Towards an efficient big data indexing approach under an uncertain environment
CN107944723A (en) A kind of " Tujia " picture weaving in silk cultural resource classification annotation method and system based on body
Zamil et al. The application of semantic-based classification on big data
CN109002561A (en) Automatic document classification method, system and medium based on sample keyword learning
CN113505207A (en) Machine reading understanding method and system for financial public opinion research and report
CN111242519B (en) User characteristic data generation method and device and electronic equipment
US20140067874A1 (en) Performing predictive analysis
CN113111136A (en) Entity disambiguation method and device based on UCL knowledge space
KR20230059364A (en) Public opinion poll system using language model and method thereof
CN107562909A (en) A kind of big data analysis system and its analysis method for merging search and calculating

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20161214