CN113239274B - Behavior big data automatic acquisition system - Google Patents
Behavior big data automatic acquisition system Download PDFInfo
- Publication number
- CN113239274B CN113239274B CN202110548065.XA CN202110548065A CN113239274B CN 113239274 B CN113239274 B CN 113239274B CN 202110548065 A CN202110548065 A CN 202110548065A CN 113239274 B CN113239274 B CN 113239274B
- Authority
- CN
- China
- Prior art keywords
- webpage
- character
- module
- user
- character string
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3438—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment monitoring of user actions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
The invention discloses an automatic behavior big data acquisition system, which comprises a webpage acquisition module, a behavior big data acquisition module and a behavior big data acquisition module, wherein the webpage acquisition module is used for recording webpages browsed by a user; the webpage caching module records webpages browsed by a user within a certain time in sequence according to the browsing time sequence, and each webpage corresponds to a browsing time; the webpage counting module is used for converting the webpage in the webpage caching module into webpage data; the digital arrangement module arranges all the webpage data in the webpage cache module to obtain a webpage array; the character output module inputs the webpage array into the support vector machine to be output to obtain the user character; and the webpage recommending module pushes the webpage to the user according to the user character obtained by the character output module. The invention records the access sequence of the user in an array mode, trains the access sequence and the access time by using a support vector machine to obtain the output user character, and thus finishes the acquisition of the user character.
Description
Technical Field
The invention relates to the field of data acquisition, in particular to an automatic behavior big data acquisition system.
Background
Big data is a product of the current high-tech era, and is a product which processes massive data to finally obtain a desired result, wherein one expression mode is to recommend contents which are interested by a user when the user accesses the internet. Before recommendation, the access conditions of the user to the internet can be collected, the user preference is obtained according to the access conditions, and finally the appropriate content is recommended according to the user preference, so that the effective browsing amount of the user is increased, and the experience of the user in accessing the internet is improved. At present, when the access conditions of a user to the internet are collected, the types of webpages accessed by the user are stored, and the preference of the user is judged according to the number of the types of the stored webpages, but the mode has no any embodiment on the sequence of webpage browsing, that is, the preference of the user can be obtained only, and the personality of the user cannot be obtained, and when recommendation is performed, the label of each webpage comprises two attributes of the personality of the user and the preference of the user, so that the recommendation performed by using a single attribute is inaccurate, and the user experience is reduced in serious cases. If the types of the web pages are stored according to the access sequence, a large amount of cache space is needed, so that the internet runs very slowly, and the user experience is still reduced.
Disclosure of Invention
The invention aims to overcome the problems in the prior art and provide an automatic behavior big data acquisition system, which records the visit sequence of a user in an array mode, trains the visit sequence and the visit time by using a support vector machine to obtain the output user character, and accordingly completes the acquisition of the user character.
Therefore, the invention provides a behavior big data automatic acquisition system, which comprises:
the webpage acquisition module is used for monitoring the access of a user to a website and recording the webpages browsed by the user;
the webpage caching module is used for sequentially recording webpages browsed by a user within a certain time according to the browsing time sequence, and each webpage corresponds to one browsing time;
the webpage counting module is used for searching the webpage number of each webpage in the webpage caching module in a database, digitizing the browsing time corresponding to the webpage to obtain digital time, and combining the number and the digital time to obtain webpage data;
the database is used for receiving the web pages and feeding back the corresponding web page numbers;
the digital arrangement module is used for arranging all the webpage data in the webpage cache module to obtain a webpage array;
the character output module inputs the webpage array into a support vector machine and outputs the webpage array to obtain the user character;
and the webpage recommending module is used for pushing the webpage consistent with the user character in the label of the webpage to the user according to the user character obtained by the character outputting module.
Further, when the database receives the web page and feeds back the corresponding web page number, the method comprises the following steps:
receiving a webpage with a webpage number to be acquired;
extracting a link of a webpage from the webpage;
disassembling the link according to a set rule to obtain a plurality of sequentially arranged character strings;
digitizing each character string in sequence so that each character string corresponds to a digit;
and outputting the sequentially obtained numbers as the webpage numbers.
Further, the method for digitizing the character string includes the following steps:
disassembling the character strings to obtain sequentially arranged characters;
acquiring a number corresponding to each character;
weighting the number corresponding to each character by using a set weight according to the length of the character string;
and calculating to obtain the number corresponding to the character string.
Further, when the weight is obtained, the length of the character string is first obtained. Searching a corresponding weight proportion in a weight library according to the length of the character string, and finally giving the corresponding weight proportion to each corresponding character in sequence; the weight library is used for length and weight proportion corresponding to the length.
Further, the webpage data are composed of a plurality of numbers, the webpage data comprise a front digital area and a rear digital area, the front digital area comprises a first set number of digital vacancies, the rear digital area comprises a second set number of digital vacancies, and each digital vacancy can store one number; the digital area in front is used for storing the webpage number, the digital area in back is used for storing the digital time, and when the digital vacancy is not stored, the output is 0.
Further, the support vector machine is trained prior to use.
Further, the character output module normalizes the web page array before inputting the web page array into a support vector machine.
The behavior big data automatic acquisition system provided by the invention has the following beneficial effects:
1. the method records the access sequence of the user in an array mode, trains the access sequence and the access time by using a support vector machine to obtain the output user character, and accordingly completes the collection of the user character;
2. the invention uses normalization algorithm before processing the array, so that the neatness and the singleness of the data can be well ensured when the array is processed.
Drawings
FIG. 1 is a schematic block diagram of the overall connection of a behavior big data automatic acquisition system provided by the present invention;
FIG. 2 is a schematic block diagram of a process of receiving a web page and feeding back a corresponding web page number by the database of the present invention;
FIG. 3 is a schematic block diagram of a process for digitizing a string according to the present invention.
Detailed Description
One embodiment of the present invention will be described in detail below with reference to the accompanying drawings, but it should be understood that the scope of the invention is not limited to the embodiment.
Specifically, as shown in fig. 1 to 3, an embodiment of the present invention provides an automatic behavior big data acquisition system, including: the system comprises a webpage acquisition module, a webpage cache module, a webpage counting module, a database, a number arrangement module, a character output module and a webpage recommendation module. The function and operation of each module will be described in detail below.
The webpage acquisition module is used for monitoring the access of a user to a website and recording the webpages browsed by the user; the module is a program of cold silence recording integrated in the browser or the internet computer, and records the webpage browsed by the user in the special storage space of the user during recording.
The webpage caching module is used for sequentially recording webpages browsed by a user within a certain time according to the browsing time sequence, and each webpage corresponds to one browsing time; the module is a storage mode of the storage space, and generally stores the storage space in a list mode, each webpage corresponds to a browsing time, and the browsing time is the time when a user browses the webpage.
The webpage counting module is used for searching the webpage number of each webpage in the webpage caching module in a database, digitizing the browsing time corresponding to the webpage to obtain digital time, and combining the number and the digital time to obtain webpage data; the module expresses the web pages of all the lists of the storage space and the corresponding browsing time in a digital form, and the digital form is web page data, so that the web pages and the browsing time can be converted into numbers. During representation, the webpage corresponds to the webpage number, the browsing time corresponds to the digital time, and the webpage number and the digital time are represented in a digital mode.
The database is used for receiving the web pages and feeding back the corresponding web page numbers; the database stores the data in a list manner.
The digital arrangement module is used for arranging all the webpage data in the webpage cache module to obtain a webpage array; the module arranges all webpage cache data, each webpage cache data is a number, an array can be obtained after arrangement, the array is a webpage array, and a vector can be constructed.
The character output module inputs the webpage array into a support vector machine and outputs the webpage array to obtain the user character; in the module, a process of an artificial intelligence learning model is used, a support vector machine is used as a training model, and a webpage array is converted into a user character according to past training experience, so that the character of a user can be obtained according to the past experience, and the obtained data is more accurate.
The webpage recommending module is used for pushing the webpage consistent with the user character in the label of the webpage to the user according to the user character obtained by the character outputting module; the module recommends the webpage of the proper user to the user, so that the internet surfing efficiency of the user is improved when the user browses the webpage, and internet surfing experience of the user is improved.
According to the technical scheme, websites visited by user history are obtained through the webpage collecting module, the websites visited by the user history are cached, the websites after caching are subjected to digital processing, the websites are arranged into arrays, the number of the arrays is fixed, the arrays are obtained generally according to the time sequence, the arrays can be made to be very accurate, the arrays are trained through a support vector machine, the character of a user can be obtained, the webpages suitable for browsing through the character of the output user are obtained, and therefore the internet surfing efficiency and the internet surfing experience of the user are improved.
In this embodiment, when the database receives a web page and feeds back a corresponding web page number, the method includes the following steps:
receiving a webpage with a webpage number to be acquired;
(II) extracting links of the webpage from the webpage;
thirdly, disassembling the link according to a set rule to obtain a plurality of sequentially arranged character strings;
(IV) digitizing each character string in turn, so that each character string corresponds to a number;
and (V) outputting the sequentially obtained numbers as the webpage numbers.
In the above steps (one) - (five), the process of converting the web page into the web page number is performed, and in the technical scheme, the corresponding web page number is obtained according to the link of the web page, wherein step (one) is a process of acquiring the web page, step (two) is a process of obtaining the web page link according to the web page, step (three) is a process of processing the web page link, so that an arranged character string can be obtained, step (four) is a process of digitizing the character string, so that one character string corresponds to one number, and step (five) is to output the obtained number, so as to obtain one number, that is, the web page number.
Meanwhile, in this embodiment, when digitizing the character string, the method includes the following steps:
(1) disassembling the character strings to obtain sequentially arranged characters;
(2) acquiring a number corresponding to each character;
(3) weighting the number corresponding to each character by using a set weight according to the length of the character string;
(4) and calculating to obtain the number corresponding to the character string.
In the steps (1) - (4), the process of digitizing the character string is a process of how each character string is converted into a number, in the step (1), the character string is disassembled to obtain a plurality of characters forming the character string, so that sequentially arranged characters can be obtained, in the step (2), each character sequentially corresponds to a number, in the step (3), the number corresponding to each character is weighted, the weighted weight is set according to the length of the character string, so that how long the obtained character string is, the corresponding numbers are in the set range, the limitation of the numbers is ensured, and in the step (4), the process of calculating is limited, so that the numbers corresponding to the character string are obtained.
Meanwhile, in this embodiment, when the weight is obtained, the length of the character string is obtained first. Searching a corresponding weight proportion in a weight library according to the length of the character string, and finally giving the corresponding weight proportion to each corresponding character in sequence; the weight library is used for length and weight proportion corresponding to the length. The method for giving the weight provided by the invention can enable each character string to correspond to a weight, so that the number of the character string can be finally obtained corresponding to each corresponding character after the weight is obtained.
In this embodiment, the web page data includes a plurality of numbers, the web page data includes a front digital region and a rear digital region, the front digital region includes a first set number of digital slots, the rear digital region includes a second set number of digital slots, and each digital slot is capable of storing one number; the digital area in front is used for storing the webpage number, the digital area in back is used for storing the digital time, and when the digital vacancy is not stored, the output is 0.
In the technical scheme, the digits with set digits can be used for representing, so that the digits of different digits are consistent, the digits of all digits are consistent in the webpage array formed subsequently, the dimensions of the digits are consistent, the input data are more accurate during training, and the training result is more accurate.
In this embodiment, the support vector machine is trained prior to use. Therefore, a trained learning model can be obtained, and the character of the obtained user is closer to the actual character until the character is consistent in the subsequent use.
In this embodiment, the character output module performs normalization processing on the web page array before inputting the web page array into the support vector machine. Through normalization processing, the complexity of the obtained array can be reduced, and the array is more convenient in subsequent processing.
The above disclosure is only for a few specific embodiments of the present invention, however, the present invention is not limited to the above embodiments, and any variations that can be made by those skilled in the art are intended to fall within the scope of the present invention.
Claims (7)
1. An automated behavioral big data acquisition system, comprising:
the webpage acquisition module is used for monitoring the access of a user to a website and recording the webpages browsed by the user;
the webpage caching module is used for sequentially recording webpages browsed by a user within a certain time according to the browsing time sequence, and each webpage corresponds to one browsing time;
the webpage counting module is used for searching the webpage number of each webpage in the webpage caching module in a database, digitizing the browsing time corresponding to the webpage to obtain digital time, and combining the number and the digital time to obtain webpage data;
the database is used for receiving the webpage with the webpage number to be acquired; extracting a link of a webpage from the webpage; the system is used for disassembling the link according to a set rule to obtain a plurality of sequentially arranged character strings; the method comprises the steps of acquiring a number corresponding to each character; the system is used for weighting the number corresponding to each character by using a set weight according to the length of the character string; the digital processing device is used for calculating to obtain the number corresponding to the character string; the digital-to-analog converter is used for digitizing each character string in sequence so that each character string corresponds to a number; the system is used for outputting the sequentially obtained numbers as the webpage numbers;
the digital arrangement module is used for arranging all the webpage data in the webpage cache module to obtain a webpage array;
the character output module inputs the webpage array into a support vector machine and outputs the webpage array to obtain the user character;
and the webpage recommending module is used for pushing the webpage consistent with the user character in the label of the webpage to the user according to the user character obtained by the character outputting module.
2. The automated behavioral big data acquisition system according to claim 1, wherein the database, when receiving the web pages and feeding back the corresponding web page numbers, comprises the following steps:
receiving a webpage with a webpage number to be acquired;
extracting a link of a webpage from the webpage;
disassembling the link according to a set rule to obtain a plurality of sequentially arranged character strings;
digitizing each character string in sequence so that each character string corresponds to a digit;
and outputting the sequentially obtained numbers as the webpage numbers.
3. The automated behavioral big data acquisition system according to claim 2, wherein the step of digitizing the character string comprises the steps of:
disassembling the character strings to obtain sequentially arranged characters;
acquiring a number corresponding to each character;
weighting the number corresponding to each character by using a set weight according to the length of the character string;
and calculating to obtain the number corresponding to the character string.
4. The automated behavioral big data acquisition system according to claim 3, wherein when acquiring the weight, the length of the character string is firstly acquired, a corresponding weight proportion is searched in a weight library according to the length of the character string, and finally the corresponding weight proportion is sequentially given to each corresponding character; the weight library is used for length and weight proportion corresponding to the length.
5. The automated behavioral big data acquisition system according to claim 1, wherein the web page data is composed of a plurality of digits, the web page data includes two front and rear digit regions, the front digit region includes a first predetermined number of digit slots, the rear digit region includes a second predetermined number of digit slots, each digit slot is capable of storing a digit; the digital area in front is used for storing the webpage number, the digital area in back is used for storing the digital time, and when the digital vacancy is not stored, the output is 0.
6. The automated behavioral big data acquisition system according to claim 1, wherein the support vector machine is trained prior to use.
7. The automated behavioral big data acquisition system according to claim 1, wherein the personality output module normalizes the web page array before inputting the web page array to a support vector machine.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110548065.XA CN113239274B (en) | 2021-05-19 | 2021-05-19 | Behavior big data automatic acquisition system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110548065.XA CN113239274B (en) | 2021-05-19 | 2021-05-19 | Behavior big data automatic acquisition system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113239274A CN113239274A (en) | 2021-08-10 |
CN113239274B true CN113239274B (en) | 2022-05-17 |
Family
ID=77137763
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110548065.XA Active CN113239274B (en) | 2021-05-19 | 2021-05-19 | Behavior big data automatic acquisition system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113239274B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109063053A (en) * | 2018-07-20 | 2018-12-21 | 北京开普云信息科技有限公司 | A kind of method and system that web-site map reconstructs automatically |
CN110674404A (en) * | 2019-09-27 | 2020-01-10 | 北京京东振世信息技术有限公司 | Link information generation method, device, system, storage medium and electronic equipment |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10043197B1 (en) * | 2012-06-14 | 2018-08-07 | Rocket Fuel Inc. | Abusive user metrics |
US20180052939A1 (en) * | 2016-08-22 | 2018-02-22 | Qualcomm Incorporated | Systems and methods for categorizing webpage bookmarks |
CN108132950A (en) * | 2016-12-01 | 2018-06-08 | 阿里巴巴集团控股有限公司 | Information displaying method, information providing method, apparatus and system |
CN107247789A (en) * | 2017-06-16 | 2017-10-13 | 成都布林特信息技术有限公司 | user interest acquisition method based on internet |
CN110007842A (en) * | 2019-04-18 | 2019-07-12 | 北京冠群信息技术股份有限公司 | Web page contents choosing method and device |
CN110196954A (en) * | 2019-06-14 | 2019-09-03 | 深圳市珍爱捷云信息技术有限公司 | Webpage backspacing processing method of extensive makeup, device, computer equipment and storage medium |
-
2021
- 2021-05-19 CN CN202110548065.XA patent/CN113239274B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109063053A (en) * | 2018-07-20 | 2018-12-21 | 北京开普云信息科技有限公司 | A kind of method and system that web-site map reconstructs automatically |
CN110674404A (en) * | 2019-09-27 | 2020-01-10 | 北京京东振世信息技术有限公司 | Link information generation method, device, system, storage medium and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN113239274A (en) | 2021-08-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102831199B (en) | Method and device for establishing interest model | |
US20190362267A1 (en) | Method of and system for generating a prediction model and determining an accuracy of a prediction model | |
US8751466B1 (en) | Customizable answer engine implemented by user-defined plug-ins | |
WO2022142519A1 (en) | Information recommendation method and apparatus, and electronic device and storage medium | |
CN111708740A (en) | Mass search query log calculation analysis system based on cloud platform | |
WO2011008848A2 (en) | Activity based users' interests modeling for determining content relevance | |
CN111488137B (en) | Code searching method based on common attention characterization learning | |
CN100462969C (en) | Method for providing and inquiry information for public by interconnection network | |
US20170329845A1 (en) | Methods and apparatuses for content preparation and/or selection | |
CN108959413B (en) | Topic webpage crawling method and topic crawler system | |
US20150356202A1 (en) | Methods and apparatus for identifying concepts corresponding to input information | |
JP2008538149A (en) | Rating method, search result organizing method, rating system, and search result organizing system | |
CN106407316B (en) | Software question and answer recommendation method and device based on topic model | |
CN101211368B (en) | Method for classifying search term, device and search engine system | |
CN108959550B (en) | User focus mining method, device, equipment and computer readable medium | |
CN111353095A (en) | Intelligent information management system based on SEO optimization | |
CN112417133A (en) | Training method and device of ranking model | |
CN115130601A (en) | Two-stage academic data webpage classification method and system based on multi-dimensional feature fusion | |
US10157222B2 (en) | Methods and apparatuses for content preparation and/or selection | |
CN113239274B (en) | Behavior big data automatic acquisition system | |
CN110851708B (en) | Negative sample extraction method, device, computer equipment and storage medium | |
CN111651675A (en) | UCL-based user interest topic mining method and device | |
US20220222430A1 (en) | Providing user-specific previews within text | |
CN113392329A (en) | Content recommendation method and device, electronic equipment and storage medium | |
CN115687810A (en) | Webpage searching method and device and related equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |