CN103605724A - Webpage-text semantic feature based on-line retail sales computation method - Google Patents

Webpage-text semantic feature based on-line retail sales computation method Download PDF

Info

Publication number
CN103605724A
CN103605724A CN201310575302.7A CN201310575302A CN103605724A CN 103605724 A CN103605724 A CN 103605724A CN 201310575302 A CN201310575302 A CN 201310575302A CN 103605724 A CN103605724 A CN 103605724A
Authority
CN
China
Prior art keywords
web page
field
feature
order
numerical value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310575302.7A
Other languages
Chinese (zh)
Inventor
柴跃廷
孙骁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN201310575302.7A priority Critical patent/CN103605724A/en
Publication of CN103605724A publication Critical patent/CN103605724A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The invention discloses a webpage-text semantic feature based on-line retail sales computation method. The method includes the following steps: performing stratified sampling on netizen population to obtain samples; monitoring Internet surfing behaviors of sample members in real time, discovering released online shopping orders of the sample members on the basis of webpage semantic features, and capturing order amounts from the orders on the basis of the webpage semantic features; performing real-time summary and statistics on sample online shopping information to acquire on-line retail sales, wherein the sample online shopping information includes the orders and the order amounts. The method has the advantages of high real-time performance, high accuracy and high trueness.

Description

Online retail sales computing method based on web page text semantic feature
Technical field
The present invention relates to large data technique field, internet, be specifically related to a kind of online retail sales computing method based on web page text semantic feature.
Background technology
The full-time sky in internet is accessible, virtual, the feature of Opening, and ecommerce is developed rapidly as a kind of emerging business model.In order to carry out scientific and effective management, conventionally need to add up the online total volume of retail sales in certain period.The computing method of calculating online retail sales in prior art are broadly divided into three classes, outline as follows.
1. settlement center's method
This method needs target market that one or more settlement center are set, to each transaction is placed on record.Because settlement center has recorded schedule of dealing each time, so the turnover obtaining is by this method the most accurately, be also real-time simultaneously.Situation in stock market just so.In the market of the online retail satisfying the needs of consumers, there is the settlement center of oneself in any one enterprise, i.e. order processing system, but enterprise is for all considerations, can not disclose faithfully the turnover of oneself, has the composition of exaggerating more.
2. indirect statistic law
The thought of this method is to utilize supplementary, indirectly imputed transaction.Supplementary generally comes from the step playing an important role in transaction flow, for example fulfiling of most of orders all be unable to do without logistics, if want so to know the turnover of a certain enterprise in one period, can measure by obtaining the logistics parcel of enterprise in this period, be multiplied by again objective unit price, can roughly calculate the turnover of enterprise.Similarly reason, also can add up the capital quantity that the financial institutions such as this section of time Nei Ge big bank, third party's payment platform, mail remittance flow into enterprise.The drawback of this method is that the inaccurate of information source even can not obtain, no matter the logistics of obtaining or cash flow are not easy things.Therefore this method can only provide the reference value of turnover.
3. sampling statistics method
This method, based on Survey Theory, is first divided into different groups by target market, then samples in different groups inside, obtains the sample of some, and each sample is made investigation, and finally result is gathered, and extrapolates overall index value.The theoretical foundation of this method is solid, is current most widely used turnover statistical method.What U.S. Census Bureau was leading once repeatedly adopted in this way the annual inquiry of E-commerce market transaction size, whole enterprises are pressed to colony's stratified samplings such as manufacturing industry, wholesale business, retail trade, service sector, each colony inside is done further division again, such as wholesale business, is divided into again electrical type, medicine class, industrial part class etc.By mathematical statistics, guaranteed, if sampling process meets some requirements, the result of so this method can allow people convince.But the enterprise oneself that sample data derives from reception questionnaire reports, be difficult to guarantee its objectivity.Meanwhile, this method requires enforcement side to grasp the structured data of target market, has powerful enforcement power and a large amount of manpower and materials concurrently.Therefore, this method is only applicable to the annual market survey of Government-Leading, and single organizational structure is difficult to implement, and in addition, it can only provide the data of the market level, and lacks the meticulous investigation to enterprise level, and statistics also has hysteresis to a certain degree.
Above three kinds of methods in essence, have all continued the thinking of traditional market, do not make full use of the feature of this online transaction mode of ecommerce.
Summary of the invention
The present invention is intended at least solve one of technical matters existing in prior art.
For this reason, the object of the invention is to propose a kind of online retail sales computing method based on web page text semantic feature.
To achieve these goals, the online retail sales computing method based on web page text semantic feature according to the embodiment of the present invention, comprising: netizen is totally carried out to stratified sampling and obtain sample; Real-Time Monitoring sample member's internet behavior, the order that the shopping at network based on sample member described in web page semantics characteristic discover is assigned, and based on web page semantics feature, from described order, capture the order amount of money; Network of samples shopping information is carried out to System and statistics, obtain described online retail sales, wherein, described network of samples shopping information comprises described order and the described order amount of money.
According to the online retail sales computing method based on web page text semantic feature of the embodiment of the present invention, compared with prior art, advantage is: sequence information is to capture and analyze in real time, so statistics is real-time; After sample member installs client software in its conventional computer, the collection of information and gathering is completely completed automatically by computing machine and Internet technology, so statistic processes is easily; The monitoring of the sample member order amount of money is completed by client internal algorithm, and algorithm is derived through theory and actual test proves effective, accurate, and statistical flowsheet has been rejected the interference of human factor, so Data Source is that objectively data are accurately.
In addition, according to the online retail sales computing method based on web page text semantic feature of the embodiment of the present invention, can also there is following additional technical feature:
In one embodiment of the invention, the described order based on sample member shopping at network is assigned described in web page semantics characteristic discover specifically comprises the following steps: obtain current web page source code; Filter out the Chinese in webpage source code; Detect in webpage Chinese text whether contain web page text feature, obtain web page characteristics vector; According to web page characteristics vector calculation web page characteristics numerical value; If described web page characteristics numerical value is greater than web page characteristics numerical threshold, described webpage is the order page, otherwise is the non-order page.
In one embodiment of the invention, according to the computing formula of web page characteristics vector calculation web page characteristics numerical value, be:
Figure BDA0000415020730000031
wherein, n is the number of extracting described web page text feature, x ii the component of described web page characteristics vector X, θ ii the component of the first parameter vector θ, p 1be described web page characteristics numerical value, wherein said the first parameter vector θ is known.
In one embodiment of the invention, described the first parameter vector θ adopts logistic regression to obtain.
In one embodiment of the invention, the described order amount of money that captures from described order specifically comprises the following steps: obtain order page webpage source code; Filter out all fields that meet predetermined structure; Detect successively each field and whether contain field text feature, obtain field feature vector; Each field, according to field feature vector calculation field feature numerical value, is chosen to the maximum of field feature numerical value described in all fields, if the described field feature numerical value of this field is greater than field feature threshold value, this field is confirmed as order amount of money place field; From described order amount of money place field, extract numeral, as the final described order amount of money.
In one embodiment of the invention, according to the computing formula of field feature vector calculation field feature numerical value, be:
Figure BDA0000415020730000032
wherein, m is the number of extracting described field text feature, y ii the component of described field feature vector Y, α ii the component of the second parameter vector α, p 2be described field feature numerical value, wherein the second parameter vector α is known.
In one embodiment of the invention, described the second parameter vector α adopts logistic regression to obtain.
In one embodiment of the invention, described predetermined structure refer to comprise Chinese character, comprise numeral and comprise " $ " or " unit " in the two one of at least structure.
Additional aspect of the present invention and advantage in the following description part provide, and part will become obviously from the following description, or recognize by practice of the present invention.
Accompanying drawing explanation
Above-mentioned and/or additional aspect of the present invention and advantage accompanying drawing below combination obviously and is easily understood becoming the description of embodiment, wherein:
Fig. 1 is the process flow diagram of the online retail sales computing method based on web page text semantic feature of the embodiment of the present invention.
Fig. 2 is the process flow diagram of the order process assigned of the described sample member of the discovery of embodiment of the present invention shopping at network.
Fig. 3 is the process flow diagram that captures order amount of money process from order of the embodiment of the present invention.
Embodiment
Describe embodiments of the invention below in detail, the example of described embodiment is shown in the drawings, and wherein same or similar label represents same or similar element or has the element of identical or similar functions from start to finish.Below by the embodiment being described with reference to the drawings, be exemplary, be intended to for explaining the present invention, and can not be interpreted as limitation of the present invention.
Ecommerce be a kind of total space, full-time, efficiently, transaction easily reaches mode, rely on Transaction Information that internet makes it and be open, obtainable, technical obtain manner can at utmost be rejected subjective factor, therefore, the computing method of online retail sales should be also real-time, easily, objectively and accurately.
As shown in Figure 1, the implementing procedure for a kind of online retail sales computing method based on web page semantics feature of the present invention, comprises the steps:
A. netizen is totally carried out to stratified sampling and obtain sample.For example, can sample etc. with Regional Distribution.
B. Real-Time Monitoring sample member's internet behavior, the order that the shopping at network based on web page semantics characteristic discover sample member is assigned, and from order, capture the order amount of money based on web page semantics feature.
C. network of samples shopping information is carried out to System and statistics, obtain online retail sales, wherein, network of samples shopping information comprises order and the order amount of money.
Compared with prior art, beneficial effect of the present invention is: sequence information is to capture and analyze in real time,
Therefore statistics is real-time; After sample member installs client software in its conventional computer, the collection of information and gathering is completely completed automatically by computing machine and Internet technology, so statistic processes is easily; The monitoring of the sample member order amount of money is completed by client internal algorithm, and algorithm is derived through theory and actual test proves effective, accurate, and statistical flowsheet has been rejected the interference of human factor, so Data Source is that objectively data are accurately.
In one embodiment of the invention, as shown in Figure 2, the order of assigning based on web page semantics characteristic discover sample member shopping at network specifically comprises the following steps: S11. obtains current web page source code; S12. filter out the Chinese in webpage source code; S13. detect in webpage Chinese text whether contain web page text feature, obtain web page characteristics vector; S14. according to web page characteristics vector calculation web page characteristics numerical value; If S15. web page characteristics numerical value is greater than web page characteristics numerical threshold, webpage is the order page, otherwise is the non-order page.
Alternatively, according to the computing formula of web page characteristics vector calculation web page characteristics numerical value, be:
P 1 = e Σ 0 n θ i x i 1 + e Σ 0 n θ i x i
Wherein, n is the number of extracting web page text feature, x ii the component of web page characteristics vector X, θ ii the component of the first parameter vector θ, p 1be web page characteristics numerical value, wherein the first parameter vector θ is known.Preferably, the first parameter vector θ adopts logistic regression to obtain.
In one embodiment of the invention, as shown in Figure 3, from order, capture the order amount of money and specifically comprise the following steps: S22. obtains order page webpage source code; S23. filter out all fields that meet predetermined structure; S23. detect successively each field and whether contain field text feature, obtain field feature vector; S24. to each field according to field feature vector calculation field feature numerical value, choose field feature numerical value the maximum in all fields, if the field feature numerical value of this field is greater than field feature threshold value, this field is confirmed as order amount of money place field; S25. from order amount of money place field, extract numeral, as the final order amount of money.It should be noted that, predetermined structure refers to and comprises Chinese character, comprises numeral and comprise " $ " or " unit " structure one of at least in the two.
Alternatively, according to the computing formula of field feature vector calculation field feature numerical value, be:
P 2 = e Σ 0 m a i y i 1 + e Σ 0 m a i y i
Wherein, m is the number of extracting field text feature, y ii the component of field feature vector Y, α ii the component of the second parameter vector α, p 2be field feature numerical value, wherein the second parameter vector α is known.Preferably, the second parameter vector α adopts logistic regression to obtain.
For making those skilled in the art understand better the present invention, enumerate specific embodiment below the present invention is at length set forth.Below enumerate embodiment only for description and interpretation the present invention, and do not form the restriction to the inventive method and technical scheme.
First, netizen is totally carried out to stratified sampling by its Regional Distribution and obtain sample.
In the 31st < < China Internet network state of development statistical report > > of CNNIC issue, itemized the investigation result of 31 of 2011-2012 inland of China province (city, autonomous region) netizen's scale and Internet penetration.Through arranging, it is as shown in table 1 that each province (city, autonomous region) netizen accounts for the total netizen's ratio in the whole nation.Suppose that required sample total is N, so in each province, (city, autonomous region) required subsample size is α N, and wherein α is that this province (city, autonomous region) number of netizen accounts for national netizen's sum ratio, and the sample mode of each subsample is telephone directory simple random sampling.
Province Ratio (%) Province Ratio (%) Province Ratio (%)
Beijing 2.59 Qinghai 0.42 Hunan 3.9
Shanghai 2.85 Hebei 5.33 Tibet 0.18
Guangdong 11.75 Shaanxi 2.75 Sichuan 4.54
Fujian 4.04 Chongqing 2.12 Anhui 3.31
Zhejiang 5.71 Ningxia 0.46 Gansu 1.41
Tianjin 1.40 Shandong 6.85 Henan 5.06
Liaoning 3.9 Hubei 4.09 Guizhou 1.76
Jiangsu 7.01 The Inner Mongol 1.71 Yunnan 2.34
Shanxi 2.81 Jilin 1.88 Jiangxi 2.24
Hainan 0.68 Heilungkiang 2.35 ? ?
Xinjiang 1.71 Guangxi 2.81 ? ?
Table 1
Allow immediately sample member download and client software be installed to detect sample member's internet behavior.This client software operates on sample member's conventional computer, order and the amount of money during in order to monitor sample member shopping at network, assigned.As Data Source, the algorithm moving in client software is basis of the present invention and core.It must be noted that, the object of client software should only be confined to the crawl of network of samples shopping information, and can not relate to any other privacy information of sample member.
Then, find the order that sample member shopping at network is assigned, and extract the order amount of money.
Particularly, client software start operates in computer background automatically, the webpage that monitor sample member browses, and its internal algorithm is divided into two classes by all webpages, and the order page and the non-order page if the order page, further extract the amount of money of order.Client software distributes and operates on each sample member's conventional Net-connected computer, when getting valid data, by internet, passes to central server, and central server corresponds to these data under the sample ID of appointment again, places on record.
Suppose in time t, in the sample that total volume is N, total n sample member assigned order, and total charge is m, and netizen adds up to S.So, the online retail sales of t in the time is Sm/N.
For whether client software internal judgment webpage is the algorithm flow of the order page, step is as follows;
(1) obtain webpage source code;
(2) filter out the Chinese in webpage source code;
(3) detect Chinese text and whether contain specific characteristic, obtain characterizing the vector of web page characteristics.
The Chinese text filtering out is investigated according to the characteristic dimension shown in table 2, and totally 30 features, are to be 1, and no is 0, obtains feature vector, X.Wherein whether feature 1 occurs for a plurality of words in url, and whether feature 2-feature 26 occur for single Chinese vocabulary in text, and whether feature 27-feature 30 occurs for a plurality of Chinese vocabularies in text.
Figure BDA0000415020730000061
Figure BDA0000415020730000071
Table 2
(4) proper vector substitution algorithm is obtained a result;
The following formula of feature vector, X substitution by obtaining, draws result of calculation.
P 1 = e &Sigma; 0 n &theta; i x i 1 + e &Sigma; 0 n &theta; i x i
Wherein, n=30, θ iparameter, x i(1≤i≤30) are the two-value components of feature value vector X, x 0=1.Parameter vector θ=[122.2753,48.556 ,-12.9599,14.4019,17.2201,5.1251,
10.3428,-12.9599,14.4019,17.2201,-23.4849,16.4600,10.4093,1.6334,3.4574,-15.2296,10.1750,5.1112,13.0875,10.3107,13.3298,21.1179,8.8934,-6.1415,-0.4499,13.5299,3.1086,12.5171,4.1025,7.4798,19.8542]。
(4) result is greater than threshold value, and target web is the order page; Otherwise be the non-order page;
If the p1 calculating in step (3) is greater than 0.5, judge that current web page is as the order page; Otherwise not the order page.
For example, for the order page of certain e-commerce venture's shopping website, url is " http://shopping.
* * .com/order/orderpayment ", the Chinese text that filtering web page source code draws is as follows:
" Sign My Cart modify consignee consignee consignee information information information modification payment method and shipping method to modify payment modified cash payment delivery delivery cash machine card to enjoy free single PayPal payment opportunity to view online banking list View online payment orders submitted cashback tips payment by credit card payment will be refunded the amount actually paid to your personal account cashback days after the effective date of the time of receipt of the order , please select the credit card spending money order workdays and holidays write the following name and contact information for the selection of the name of the payee submit orders after the order timely processing , please be sure to fill the order number and contact telephone distribution methods courier delivery product Information Product information product information return modifications to cart modify the product product name Brand Name color Size Quantity Price select a gift card or promotional code promotional code coupon code discount merchandise total amount of discount promotions Shipment account to pay the amount due to participate in the actual amount of reusable shopping activities using electronic certificates parcel delivery shopping list . "
According to the characteristic dimension of table 2, the proper vector of this page is X=[0,1,1,0,1,0,1,1,1,1,1,1,0,1,0,1,1,1,0,1,1,1,1,1,0,1,0,1,1,1], in substitution algorithm, try to achieve p=0.9999 > 0.5, therefore judge that this webpage is an order page.
For client software inside obtains the algorithm flow of the concrete amount of money of order in the order page, step is as follows:
(1) obtain webpage source code.
(2) filter out Chinese, numeral, decimal deparator and the RMB symbol $ in webpage source code.Particularly, filter out shape as the field of " 2-10 Chinese character+0 or 1 RMB symbol numeral+0 or 1 Chinese character ' unit ' ".
(3) detect successively each field and whether contain specific characteristic, draw the vector that characterizes entry feature.
The field filtering out is investigated according to the characteristic dimension shown in table 3, and totally 10 features, are to be 1, and no is 0, obtains feature vector, X.Wherein feature 1 is for digital anteroposterior position, and whether feature 2-feature 7 occur for single Chinese vocabulary in field, and whether feature 8-feature 10 occurs for a plurality of Chinese vocabularies in field.
Figure BDA0000415020730000081
Table 3
(4) successively by field feature vector X substitution algorithm, result maximum as order amount of money place field, algorithmic formula is as follows:
P 2 = e &Sigma; 0 m a i y i 1 + e &Sigma; 0 m a i y i
Wherein, m=10, y i(1≤i≤30) are the two-value components of feature value vector X, y 0=1, parameter vector α=[85.984,37.549,64.451,46.218,64.996,7.438,47.192,16.027,5.099,48.150,54.351].
(5) from order amount of money place field, extract numeral, as the final order amount of money.
For the order page of certain ecommerce shopping website, the Chinese drawing from filtering web page source code extracts field, and it is as shown in table 4 according to algorithm, to calculate result equally.Wherein result maximum is " the preferential code of sales promotion commodity total charge 29.00 is excellent " field, and using it as order amount of money place field, and to extract the order amount of money be 29.00 yuan.
Figure BDA0000415020730000092
Table 4
Finally, the System of sample member shopping information and statistics.
Particularly, client software sends sample member's shopping information to central server in real time, and central server utilizes these data, infers the online retail sales in certain period.
As from the foregoing, in this embodiment, statistical information derives from the client monitors software being arranged on the conventional Net-connected computer of sample member, when sample member browses e-commerce website, client software is monitored its order of assigning and amount of money on backstage, and upload effective information, so statistics is real-time; After sample member installs client software in its conventional computer, the collection of information and gathering is completely completed automatically by computing machine and Internet technology, so statistic processes is easily; The monitoring of the sample member order amount of money is completed by client internal algorithm, and algorithm is derived through theory and actual test proves effective, accurate, and statistical flowsheet has been rejected the interference of human factor, so Data Source is that objectively data are accurately.
In process flow diagram or any process of otherwise describing at this or method describe and can be understood to, represent to comprise that one or more is for realizing module, fragment or the part of code of executable instruction of the step of specific logical function or process, and the scope of the preferred embodiment of the present invention comprises other realization, wherein can be not according to order shown or that discuss, comprise according to related function by the mode of basic while or by contrary order, carry out function, this should be understood by embodiments of the invention person of ordinary skill in the field.
The logic and/or the step that in process flow diagram, represent or otherwise describe at this, for example, can be considered to for realizing the sequencing list of the executable instruction of logic function, may be embodied in any computer-readable medium, for instruction execution system, device or equipment (as computer based system, comprise that the system of processor or other can and carry out the system of instruction from instruction execution system, device or equipment instruction fetch), use, or use in conjunction with these instruction execution systems, device or equipment.
With regard to this instructions, " computer-readable medium " can be anyly can comprise, storage, communication, propagation or transmission procedure be for instruction execution system, device or equipment or the device that uses in conjunction with these instruction execution systems, device or equipment.The example more specifically of computer-readable medium (non-exhaustive list) comprises following: the electrical connection section (electronic installation) with one or more wirings, portable computer diskette box (magnetic device), random-access memory (ram), ROM (read-only memory) (ROM), the erasable ROM (read-only memory) (EPROM or flash memory) of editing, fiber device, and portable optic disk ROM (read-only memory) (CDROM).In addition, computer-readable medium can be even paper or other the suitable medium that can print described program thereon, because can be for example by paper or other media be carried out to optical scanning, then edit, decipher or process in electronics mode and obtain described program with other suitable methods if desired, be then stored in computer memory.
Those skilled in the art are appreciated that realizing all or part of step that above-described embodiment method carries is to come the hardware that instruction is relevant to complete by program, described program can be stored in a kind of computer-readable recording medium, this program, when carrying out, comprises step of embodiment of the method one or a combination set of.
In the description of this instructions, the description of reference term " embodiment ", " some embodiment ", " example ", " concrete example " or " some examples " etc. means to be contained at least one embodiment of the present invention or example in conjunction with specific features, structure, material or the feature of this embodiment or example description.In this manual, to the schematic statement of above-mentioned term not must for be identical embodiment or example.And, the specific features of description, structure, material or feature can one or more embodiment in office or example in suitable mode combination.In addition, those skilled in the art can carry out combination and combination by the different embodiment that describe in this instructions or example.
Although illustrated and described embodiments of the invention above, be understandable that, above-described embodiment is exemplary, can not be interpreted as limitation of the present invention, and those of ordinary skill in the art can change above-described embodiment within the scope of the invention, modification, replacement and modification.

Claims (8)

1. the online retail sales computing method based on web page text semantic feature, is characterized in that, comprise the steps: that netizen is totally carried out to stratified sampling obtains sample;
Real-Time Monitoring sample member's internet behavior, the order that the shopping at network based on sample member described in web page semantics characteristic discover is assigned, and based on web page semantics feature, from described order, capture the order amount of money;
Network of samples shopping information is carried out to System and statistics, obtain described online retail sales, wherein, described network of samples shopping information comprises described order and the described order amount of money.
2. method according to claim 1, is characterized in that, the described order based on sample member shopping at network is assigned described in web page semantics characteristic discover specifically comprises the following steps:
Obtain current web page source code;
Filter out the Chinese in webpage source code;
Detect in webpage Chinese text whether contain web page text feature, obtain web page characteristics vector;
According to web page characteristics vector calculation web page characteristics numerical value;
If described web page characteristics numerical value is greater than web page characteristics numerical threshold, described webpage is the order page, otherwise is the non-order page.
3. method according to claim 2, is characterized in that, according to the computing formula of web page characteristics vector calculation web page characteristics numerical value, is:
P 1 = e &Sigma; 0 n &theta; i x i 1 + e &Sigma; 0 n &theta; i x i
Wherein, n is the number of extracting described web page text feature, x ii the component of described web page characteristics vector X, θ ii the component of the first parameter vector θ, p 1be described web page characteristics numerical value, wherein said the first parameter vector θ is known.
4. method according to claim 3, is characterized in that, described the first parameter vector θ adopts logistic regression to obtain.
5. method according to claim 1, is characterized in that, the described order amount of money that captures from described order specifically comprises the following steps:
Obtain order page webpage source code;
Filter out all fields that meet predetermined structure;
Detect successively each field and whether contain field text feature, obtain field feature vector;
Each field, according to field feature vector calculation field feature numerical value, is chosen to the maximum of field feature numerical value described in all fields, if the described field feature numerical value of this field is greater than field feature threshold value, this field is confirmed as order amount of money place field;
From described order amount of money place field, extract numeral, as the final described order amount of money.
6. method according to claim 5, is characterized in that, according to the computing formula of field feature vector calculation field feature numerical value, is:
P 2 = e &Sigma; 0 m a i y i 1 + e &Sigma; 0 m a i y i
Wherein, m is the number of extracting described field text feature, y ii the component of described field feature vector Y, α ii the component of the second parameter vector α, p 2be described field feature numerical value, wherein the second parameter vector α is known.
7. method according to claim 6, is characterized in that, described the second parameter vector α adopts logistic regression to obtain.
8. according to the method described in claim 5-8, it is characterized in that, described predetermined structure refers to and comprises Chinese character, comprises numeral and comprise " $ " or " unit " structure one of at least in the two.
CN201310575302.7A 2013-11-15 2013-11-15 Webpage-text semantic feature based on-line retail sales computation method Pending CN103605724A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310575302.7A CN103605724A (en) 2013-11-15 2013-11-15 Webpage-text semantic feature based on-line retail sales computation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310575302.7A CN103605724A (en) 2013-11-15 2013-11-15 Webpage-text semantic feature based on-line retail sales computation method

Publications (1)

Publication Number Publication Date
CN103605724A true CN103605724A (en) 2014-02-26

Family

ID=50123946

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310575302.7A Pending CN103605724A (en) 2013-11-15 2013-11-15 Webpage-text semantic feature based on-line retail sales computation method

Country Status (1)

Country Link
CN (1) CN103605724A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101251855A (en) * 2008-03-27 2008-08-27 腾讯科技(深圳)有限公司 Equipment, system and method for cleaning internet web page
CN101295303A (en) * 2007-04-28 2008-10-29 李树德 Knowledge search engine based on intelligent noumenon and implementing method thereof
WO2013029217A1 (en) * 2011-08-26 2013-03-07 Nokia Corporation Method and apparatus for generating customizable and consolidated viewable web content collected from one or more sources
CN103034640A (en) * 2011-09-30 2013-04-10 腾讯科技(深圳)有限公司 Analysis method and system of page information
CN103218363A (en) * 2012-01-19 2013-07-24 腾讯科技(深圳)有限公司 Information processing method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101295303A (en) * 2007-04-28 2008-10-29 李树德 Knowledge search engine based on intelligent noumenon and implementing method thereof
CN101251855A (en) * 2008-03-27 2008-08-27 腾讯科技(深圳)有限公司 Equipment, system and method for cleaning internet web page
WO2013029217A1 (en) * 2011-08-26 2013-03-07 Nokia Corporation Method and apparatus for generating customizable and consolidated viewable web content collected from one or more sources
CN103034640A (en) * 2011-09-30 2013-04-10 腾讯科技(深圳)有限公司 Analysis method and system of page information
CN103218363A (en) * 2012-01-19 2013-07-24 腾讯科技(深圳)有限公司 Information processing method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
IMEN AKERMI ET AL.: ""Hybrid Method for Computing Word-Pair"", 《PROCEEDING OF THE 2ND INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE,MINING AND SEMANTICS》 *
陈浩: ""自定义主题信息抽取的研究与应用"", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Similar Documents

Publication Publication Date Title
Li et al. A MIDAS modelling framework for Chinese inflation index forecast incorporating Google search data
US20120330853A1 (en) Business intelligence based social network with virtual data-visualization cards
Fang et al. The mechanism of “Big Data” impact on consumer behavior
Setiadi et al. The Application of Delone and Mclean Framework to Analyze the Relationship Between Customer Satisfaction and User Experience of Mobile Application
Yue Foreign direct investment and the innovation performance of local enterprises
Chen et al. How website quality, service quality, perceived risk and customer satisfaction affects repurchase intension? a case of Taobao online shopping
US9460163B1 (en) Configurable extractions in social media
TW201843639A (en) Financial business analysis platform comprising a business data acquisition unit, a business database storage unit, a business data processing and analysis unit, a semantic engine processing unit and a business analysis visualization unit
Huang et al. Study on factors to adopt mobile payment for tourism e-business: based on valence theory and trust transfer theory
CN107357847B (en) Data processing method and device
Wang et al. Ultimate control rights and corporate fraud: Evidence from China
CN103605724A (en) Webpage-text semantic feature based on-line retail sales computation method
Yang Can the green credit policy enhance firm export quality? Evidence from China based on the DID model
Lu et al. Computing and applying topic-level user interactions in microblog recommendation
Xinyuan et al. A fuzzy-DEMATEL-based analysis of the factors that influence users' participation behaviors under the crowdsourcing model
Zhang et al. Regional differences in the economic consequences of the new accounting standards
Chang et al. The financial performance of cloud computing
Chen et al. Research on the relationship between precision marketing and company development ability
Liu et al. Analysis of user experience at B2C E-commerce website
Wardani et al. Comparative Relationship Of E-Commerce Growth Versus Trade (Indonesian And Singapore Case Studies)
Bao Research on Dynamic Fitting of Internet Enterprise Financial Statement Analysis Using BP Neural Network in the Era of Big Data
Jiang et al. The analysis of china’s integrity situation based on big data
Chen et al. Factors affecting the willingness of college students in Mainland China to pay for mobile learning
Li An empirical study on the effect of yu’ebao to the deposits in Chinese commercial banks
Luo et al. Precision marketing and modern information technology

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20140226

RJ01 Rejection of invention patent application after publication