CN113239274B - Behavior big data automatic acquisition system - Google Patents

Behavior big data automatic acquisition system Download PDF

Info

Publication number
CN113239274B
CN113239274B CN202110548065.XA CN202110548065A CN113239274B CN 113239274 B CN113239274 B CN 113239274B CN 202110548065 A CN202110548065 A CN 202110548065A CN 113239274 B CN113239274 B CN 113239274B
Authority
CN
China
Prior art keywords
webpage
character
module
user
character string
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110548065.XA
Other languages
Chinese (zh)
Other versions
CN113239274A (en
Inventor
贾博文
尹立航
陈月阳
付宁娴
段韶鹏
杨贝贝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhengzhou Vocational University of Information and Technology
Original Assignee
Zhengzhou Vocational University of Information and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhengzhou Vocational University of Information and Technology filed Critical Zhengzhou Vocational University of Information and Technology
Priority to CN202110548065.XA priority Critical patent/CN113239274B/en
Publication of CN113239274A publication Critical patent/CN113239274A/en
Application granted granted Critical
Publication of CN113239274B publication Critical patent/CN113239274B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3438Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment monitoring of user actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses an automatic behavior big data acquisition system, which comprises a webpage acquisition module, a behavior big data acquisition module and a behavior big data acquisition module, wherein the webpage acquisition module is used for recording webpages browsed by a user; the webpage caching module records webpages browsed by a user within a certain time in sequence according to the browsing time sequence, and each webpage corresponds to a browsing time; the webpage counting module is used for converting the webpage in the webpage caching module into webpage data; the digital arrangement module arranges all the webpage data in the webpage cache module to obtain a webpage array; the character output module inputs the webpage array into the support vector machine to be output to obtain the user character; and the webpage recommending module pushes the webpage to the user according to the user character obtained by the character output module. The invention records the access sequence of the user in an array mode, trains the access sequence and the access time by using a support vector machine to obtain the output user character, and thus finishes the acquisition of the user character.

Description

Behavior big data automatic acquisition system
Technical Field
The invention relates to the field of data acquisition, in particular to an automatic behavior big data acquisition system.
Background
Big data is a product of the current high-tech era, and is a product which processes massive data to finally obtain a desired result, wherein one expression mode is to recommend contents which are interested by a user when the user accesses the internet. Before recommendation, the access conditions of the user to the internet can be collected, the user preference is obtained according to the access conditions, and finally the appropriate content is recommended according to the user preference, so that the effective browsing amount of the user is increased, and the experience of the user in accessing the internet is improved. At present, when the access conditions of a user to the internet are collected, the types of webpages accessed by the user are stored, and the preference of the user is judged according to the number of the types of the stored webpages, but the mode has no any embodiment on the sequence of webpage browsing, that is, the preference of the user can be obtained only, and the personality of the user cannot be obtained, and when recommendation is performed, the label of each webpage comprises two attributes of the personality of the user and the preference of the user, so that the recommendation performed by using a single attribute is inaccurate, and the user experience is reduced in serious cases. If the types of the web pages are stored according to the access sequence, a large amount of cache space is needed, so that the internet runs very slowly, and the user experience is still reduced.
Disclosure of Invention
The invention aims to overcome the problems in the prior art and provide an automatic behavior big data acquisition system, which records the visit sequence of a user in an array mode, trains the visit sequence and the visit time by using a support vector machine to obtain the output user character, and accordingly completes the acquisition of the user character.
Therefore, the invention provides a behavior big data automatic acquisition system, which comprises:
the webpage acquisition module is used for monitoring the access of a user to a website and recording the webpages browsed by the user;
the webpage caching module is used for sequentially recording webpages browsed by a user within a certain time according to the browsing time sequence, and each webpage corresponds to one browsing time;
the webpage counting module is used for searching the webpage number of each webpage in the webpage caching module in a database, digitizing the browsing time corresponding to the webpage to obtain digital time, and combining the number and the digital time to obtain webpage data;
the database is used for receiving the web pages and feeding back the corresponding web page numbers;
the digital arrangement module is used for arranging all the webpage data in the webpage cache module to obtain a webpage array;
the character output module inputs the webpage array into a support vector machine and outputs the webpage array to obtain the user character;
and the webpage recommending module is used for pushing the webpage consistent with the user character in the label of the webpage to the user according to the user character obtained by the character outputting module.
Further, when the database receives the web page and feeds back the corresponding web page number, the method comprises the following steps:
receiving a webpage with a webpage number to be acquired;
extracting a link of a webpage from the webpage;
disassembling the link according to a set rule to obtain a plurality of sequentially arranged character strings;
digitizing each character string in sequence so that each character string corresponds to a digit;
and outputting the sequentially obtained numbers as the webpage numbers.
Further, the method for digitizing the character string includes the following steps:
disassembling the character strings to obtain sequentially arranged characters;
acquiring a number corresponding to each character;
weighting the number corresponding to each character by using a set weight according to the length of the character string;
and calculating to obtain the number corresponding to the character string.
Further, when the weight is obtained, the length of the character string is first obtained. Searching a corresponding weight proportion in a weight library according to the length of the character string, and finally giving the corresponding weight proportion to each corresponding character in sequence; the weight library is used for length and weight proportion corresponding to the length.
Further, the webpage data are composed of a plurality of numbers, the webpage data comprise a front digital area and a rear digital area, the front digital area comprises a first set number of digital vacancies, the rear digital area comprises a second set number of digital vacancies, and each digital vacancy can store one number; the digital area in front is used for storing the webpage number, the digital area in back is used for storing the digital time, and when the digital vacancy is not stored, the output is 0.
Further, the support vector machine is trained prior to use.
Further, the character output module normalizes the web page array before inputting the web page array into a support vector machine.
The behavior big data automatic acquisition system provided by the invention has the following beneficial effects:
1. the method records the access sequence of the user in an array mode, trains the access sequence and the access time by using a support vector machine to obtain the output user character, and accordingly completes the collection of the user character;
2. the invention uses normalization algorithm before processing the array, so that the neatness and the singleness of the data can be well ensured when the array is processed.
Drawings
FIG. 1 is a schematic block diagram of the overall connection of a behavior big data automatic acquisition system provided by the present invention;
FIG. 2 is a schematic block diagram of a process of receiving a web page and feeding back a corresponding web page number by the database of the present invention;
FIG. 3 is a schematic block diagram of a process for digitizing a string according to the present invention.
Detailed Description
One embodiment of the present invention will be described in detail below with reference to the accompanying drawings, but it should be understood that the scope of the invention is not limited to the embodiment.
Specifically, as shown in fig. 1 to 3, an embodiment of the present invention provides an automatic behavior big data acquisition system, including: the system comprises a webpage acquisition module, a webpage cache module, a webpage counting module, a database, a number arrangement module, a character output module and a webpage recommendation module. The function and operation of each module will be described in detail below.
The webpage acquisition module is used for monitoring the access of a user to a website and recording the webpages browsed by the user; the module is a program of cold silence recording integrated in the browser or the internet computer, and records the webpage browsed by the user in the special storage space of the user during recording.
The webpage caching module is used for sequentially recording webpages browsed by a user within a certain time according to the browsing time sequence, and each webpage corresponds to one browsing time; the module is a storage mode of the storage space, and generally stores the storage space in a list mode, each webpage corresponds to a browsing time, and the browsing time is the time when a user browses the webpage.
The webpage counting module is used for searching the webpage number of each webpage in the webpage caching module in a database, digitizing the browsing time corresponding to the webpage to obtain digital time, and combining the number and the digital time to obtain webpage data; the module expresses the web pages of all the lists of the storage space and the corresponding browsing time in a digital form, and the digital form is web page data, so that the web pages and the browsing time can be converted into numbers. During representation, the webpage corresponds to the webpage number, the browsing time corresponds to the digital time, and the webpage number and the digital time are represented in a digital mode.
The database is used for receiving the web pages and feeding back the corresponding web page numbers; the database stores the data in a list manner.
The digital arrangement module is used for arranging all the webpage data in the webpage cache module to obtain a webpage array; the module arranges all webpage cache data, each webpage cache data is a number, an array can be obtained after arrangement, the array is a webpage array, and a vector can be constructed.
The character output module inputs the webpage array into a support vector machine and outputs the webpage array to obtain the user character; in the module, a process of an artificial intelligence learning model is used, a support vector machine is used as a training model, and a webpage array is converted into a user character according to past training experience, so that the character of a user can be obtained according to the past experience, and the obtained data is more accurate.
The webpage recommending module is used for pushing the webpage consistent with the user character in the label of the webpage to the user according to the user character obtained by the character outputting module; the module recommends the webpage of the proper user to the user, so that the internet surfing efficiency of the user is improved when the user browses the webpage, and internet surfing experience of the user is improved.
According to the technical scheme, websites visited by user history are obtained through the webpage collecting module, the websites visited by the user history are cached, the websites after caching are subjected to digital processing, the websites are arranged into arrays, the number of the arrays is fixed, the arrays are obtained generally according to the time sequence, the arrays can be made to be very accurate, the arrays are trained through a support vector machine, the character of a user can be obtained, the webpages suitable for browsing through the character of the output user are obtained, and therefore the internet surfing efficiency and the internet surfing experience of the user are improved.
In this embodiment, when the database receives a web page and feeds back a corresponding web page number, the method includes the following steps:
receiving a webpage with a webpage number to be acquired;
(II) extracting links of the webpage from the webpage;
thirdly, disassembling the link according to a set rule to obtain a plurality of sequentially arranged character strings;
(IV) digitizing each character string in turn, so that each character string corresponds to a number;
and (V) outputting the sequentially obtained numbers as the webpage numbers.
In the above steps (one) - (five), the process of converting the web page into the web page number is performed, and in the technical scheme, the corresponding web page number is obtained according to the link of the web page, wherein step (one) is a process of acquiring the web page, step (two) is a process of obtaining the web page link according to the web page, step (three) is a process of processing the web page link, so that an arranged character string can be obtained, step (four) is a process of digitizing the character string, so that one character string corresponds to one number, and step (five) is to output the obtained number, so as to obtain one number, that is, the web page number.
Meanwhile, in this embodiment, when digitizing the character string, the method includes the following steps:
(1) disassembling the character strings to obtain sequentially arranged characters;
(2) acquiring a number corresponding to each character;
(3) weighting the number corresponding to each character by using a set weight according to the length of the character string;
(4) and calculating to obtain the number corresponding to the character string.
In the steps (1) - (4), the process of digitizing the character string is a process of how each character string is converted into a number, in the step (1), the character string is disassembled to obtain a plurality of characters forming the character string, so that sequentially arranged characters can be obtained, in the step (2), each character sequentially corresponds to a number, in the step (3), the number corresponding to each character is weighted, the weighted weight is set according to the length of the character string, so that how long the obtained character string is, the corresponding numbers are in the set range, the limitation of the numbers is ensured, and in the step (4), the process of calculating is limited, so that the numbers corresponding to the character string are obtained.
Meanwhile, in this embodiment, when the weight is obtained, the length of the character string is obtained first. Searching a corresponding weight proportion in a weight library according to the length of the character string, and finally giving the corresponding weight proportion to each corresponding character in sequence; the weight library is used for length and weight proportion corresponding to the length. The method for giving the weight provided by the invention can enable each character string to correspond to a weight, so that the number of the character string can be finally obtained corresponding to each corresponding character after the weight is obtained.
In this embodiment, the web page data includes a plurality of numbers, the web page data includes a front digital region and a rear digital region, the front digital region includes a first set number of digital slots, the rear digital region includes a second set number of digital slots, and each digital slot is capable of storing one number; the digital area in front is used for storing the webpage number, the digital area in back is used for storing the digital time, and when the digital vacancy is not stored, the output is 0.
In the technical scheme, the digits with set digits can be used for representing, so that the digits of different digits are consistent, the digits of all digits are consistent in the webpage array formed subsequently, the dimensions of the digits are consistent, the input data are more accurate during training, and the training result is more accurate.
In this embodiment, the support vector machine is trained prior to use. Therefore, a trained learning model can be obtained, and the character of the obtained user is closer to the actual character until the character is consistent in the subsequent use.
In this embodiment, the character output module performs normalization processing on the web page array before inputting the web page array into the support vector machine. Through normalization processing, the complexity of the obtained array can be reduced, and the array is more convenient in subsequent processing.
The above disclosure is only for a few specific embodiments of the present invention, however, the present invention is not limited to the above embodiments, and any variations that can be made by those skilled in the art are intended to fall within the scope of the present invention.

Claims (7)

1. An automated behavioral big data acquisition system, comprising:
the webpage acquisition module is used for monitoring the access of a user to a website and recording the webpages browsed by the user;
the webpage caching module is used for sequentially recording webpages browsed by a user within a certain time according to the browsing time sequence, and each webpage corresponds to one browsing time;
the webpage counting module is used for searching the webpage number of each webpage in the webpage caching module in a database, digitizing the browsing time corresponding to the webpage to obtain digital time, and combining the number and the digital time to obtain webpage data;
the database is used for receiving the webpage with the webpage number to be acquired; extracting a link of a webpage from the webpage; the system is used for disassembling the link according to a set rule to obtain a plurality of sequentially arranged character strings; the method comprises the steps of acquiring a number corresponding to each character; the system is used for weighting the number corresponding to each character by using a set weight according to the length of the character string; the digital processing device is used for calculating to obtain the number corresponding to the character string; the digital-to-analog converter is used for digitizing each character string in sequence so that each character string corresponds to a number; the system is used for outputting the sequentially obtained numbers as the webpage numbers;
the digital arrangement module is used for arranging all the webpage data in the webpage cache module to obtain a webpage array;
the character output module inputs the webpage array into a support vector machine and outputs the webpage array to obtain the user character;
and the webpage recommending module is used for pushing the webpage consistent with the user character in the label of the webpage to the user according to the user character obtained by the character outputting module.
2. The automated behavioral big data acquisition system according to claim 1, wherein the database, when receiving the web pages and feeding back the corresponding web page numbers, comprises the following steps:
receiving a webpage with a webpage number to be acquired;
extracting a link of a webpage from the webpage;
disassembling the link according to a set rule to obtain a plurality of sequentially arranged character strings;
digitizing each character string in sequence so that each character string corresponds to a digit;
and outputting the sequentially obtained numbers as the webpage numbers.
3. The automated behavioral big data acquisition system according to claim 2, wherein the step of digitizing the character string comprises the steps of:
disassembling the character strings to obtain sequentially arranged characters;
acquiring a number corresponding to each character;
weighting the number corresponding to each character by using a set weight according to the length of the character string;
and calculating to obtain the number corresponding to the character string.
4. The automated behavioral big data acquisition system according to claim 3, wherein when acquiring the weight, the length of the character string is firstly acquired, a corresponding weight proportion is searched in a weight library according to the length of the character string, and finally the corresponding weight proportion is sequentially given to each corresponding character; the weight library is used for length and weight proportion corresponding to the length.
5. The automated behavioral big data acquisition system according to claim 1, wherein the web page data is composed of a plurality of digits, the web page data includes two front and rear digit regions, the front digit region includes a first predetermined number of digit slots, the rear digit region includes a second predetermined number of digit slots, each digit slot is capable of storing a digit; the digital area in front is used for storing the webpage number, the digital area in back is used for storing the digital time, and when the digital vacancy is not stored, the output is 0.
6. The automated behavioral big data acquisition system according to claim 1, wherein the support vector machine is trained prior to use.
7. The automated behavioral big data acquisition system according to claim 1, wherein the personality output module normalizes the web page array before inputting the web page array to a support vector machine.
CN202110548065.XA 2021-05-19 2021-05-19 Behavior big data automatic acquisition system Active CN113239274B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110548065.XA CN113239274B (en) 2021-05-19 2021-05-19 Behavior big data automatic acquisition system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110548065.XA CN113239274B (en) 2021-05-19 2021-05-19 Behavior big data automatic acquisition system

Publications (2)

Publication Number Publication Date
CN113239274A CN113239274A (en) 2021-08-10
CN113239274B true CN113239274B (en) 2022-05-17

Family

ID=77137763

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110548065.XA Active CN113239274B (en) 2021-05-19 2021-05-19 Behavior big data automatic acquisition system

Country Status (1)

Country Link
CN (1) CN113239274B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109063053A (en) * 2018-07-20 2018-12-21 北京开普云信息科技有限公司 A kind of method and system that web-site map reconstructs automatically
CN110674404A (en) * 2019-09-27 2020-01-10 北京京东振世信息技术有限公司 Link information generation method, device, system, storage medium and electronic equipment

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10043197B1 (en) * 2012-06-14 2018-08-07 Rocket Fuel Inc. Abusive user metrics
US20180052939A1 (en) * 2016-08-22 2018-02-22 Qualcomm Incorporated Systems and methods for categorizing webpage bookmarks
CN108132950A (en) * 2016-12-01 2018-06-08 阿里巴巴集团控股有限公司 Information displaying method, information providing method, apparatus and system
CN107247789A (en) * 2017-06-16 2017-10-13 成都布林特信息技术有限公司 user interest acquisition method based on internet
CN110007842A (en) * 2019-04-18 2019-07-12 北京冠群信息技术股份有限公司 Web page contents choosing method and device
CN110196954A (en) * 2019-06-14 2019-09-03 深圳市珍爱捷云信息技术有限公司 Webpage backspacing processing method of extensive makeup, device, computer equipment and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109063053A (en) * 2018-07-20 2018-12-21 北京开普云信息科技有限公司 A kind of method and system that web-site map reconstructs automatically
CN110674404A (en) * 2019-09-27 2020-01-10 北京京东振世信息技术有限公司 Link information generation method, device, system, storage medium and electronic equipment

Also Published As

Publication number Publication date
CN113239274A (en) 2021-08-10

Similar Documents

Publication Publication Date Title
CN102831199B (en) Method and device for establishing interest model
US20190362267A1 (en) Method of and system for generating a prediction model and determining an accuracy of a prediction model
US8751466B1 (en) Customizable answer engine implemented by user-defined plug-ins
WO2022142519A1 (en) Information recommendation method and apparatus, and electronic device and storage medium
CN111708740A (en) Mass search query log calculation analysis system based on cloud platform
WO2011008848A2 (en) Activity based users' interests modeling for determining content relevance
CN111488137B (en) Code searching method based on common attention characterization learning
CN100462969C (en) Method for providing and inquiry information for public by interconnection network
US20170329845A1 (en) Methods and apparatuses for content preparation and/or selection
CN108959413B (en) Topic webpage crawling method and topic crawler system
US20150356202A1 (en) Methods and apparatus for identifying concepts corresponding to input information
JP2008538149A (en) Rating method, search result organizing method, rating system, and search result organizing system
CN106407316B (en) Software question and answer recommendation method and device based on topic model
CN101211368B (en) Method for classifying search term, device and search engine system
CN108959550B (en) User focus mining method, device, equipment and computer readable medium
CN111353095A (en) Intelligent information management system based on SEO optimization
CN112417133A (en) Training method and device of ranking model
CN115130601A (en) Two-stage academic data webpage classification method and system based on multi-dimensional feature fusion
US10157222B2 (en) Methods and apparatuses for content preparation and/or selection
CN113239274B (en) Behavior big data automatic acquisition system
CN110851708B (en) Negative sample extraction method, device, computer equipment and storage medium
CN111651675A (en) UCL-based user interest topic mining method and device
US20220222430A1 (en) Providing user-specific previews within text
CN113392329A (en) Content recommendation method and device, electronic equipment and storage medium
CN115687810A (en) Webpage searching method and device and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant