CN112182451A - Webpage content abstract generation method, equipment, storage medium and device - Google Patents

Webpage content abstract generation method, equipment, storage medium and device Download PDF

Info

Publication number
CN112182451A
CN112182451A CN202010983986.4A CN202010983986A CN112182451A CN 112182451 A CN112182451 A CN 112182451A CN 202010983986 A CN202010983986 A CN 202010983986A CN 112182451 A CN112182451 A CN 112182451A
Authority
CN
China
Prior art keywords
webpage
web page
displayed
content
generating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010983986.4A
Other languages
Chinese (zh)
Inventor
程波
叶志钢
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Greenet Information Service Co Ltd
Original Assignee
Wuhan Greenet Information Service Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Greenet Information Service Co Ltd filed Critical Wuhan Greenet Information Service Co Ltd
Priority to CN202010983986.4A priority Critical patent/CN112182451A/en
Publication of CN112182451A publication Critical patent/CN112182451A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]

Abstract

The invention discloses a method, a device, a storage medium and a device for generating a webpage content abstract, wherein the method comprises the following steps: when a webpage access request is received, extracting information of the webpage access request, obtaining user identification information and webpage address information, searching a historical browsing record corresponding to the user identification information, determining a current user portrait according to the historical browsing record, determining a webpage to be displayed according to the webpage address information, and generating a target webpage content abstract based on the webpage to be displayed and the current user portrait; compared with the existing mode of directly displaying the whole webpage content, the method and the device generate the current user portrait according to the historical browsing record and generate the webpage content abstract of the webpage to be displayed according to the current user portrait, overcome the defect that the webpage content abstract of the webpage to be displayed cannot be generated in the prior art, and therefore the webpage content abstract close to the browsing habit of the user can be automatically generated, and the reading efficiency is improved.

Description

Webpage content abstract generation method, equipment, storage medium and device
Technical Field
The present invention relates to the field of internet technologies, and in particular, to a method, an apparatus, a storage medium, and a device for generating a webpage content abstract.
Background
With the rapid development of information technology, the modern society enters the information explosion era, people increasingly look for information needed by themselves by means of networks, and therefore browsing web pages to acquire information becomes an indispensable part of work and life of people.
At present, when a user browses a webpage, a server often directly displays the content of the whole webpage. However, most web pages have more contents, so that the user needs to read the whole web page to know the main contents, which is inefficient.
The above is only for the purpose of assisting understanding of the technical aspects of the present invention, and does not represent an admission that the above is prior art.
Disclosure of Invention
The invention mainly aims to provide a method, equipment, a storage medium and a device for generating a webpage content abstract, and aims to solve the technical problem of how to automatically generate the webpage content abstract similar to the browsing habit of a user.
In order to achieve the above object, the present invention provides a method for generating a summary of web page contents, which comprises:
when a webpage access request is received, extracting information of the webpage access request to obtain user identification information and webpage address information;
searching a historical browsing record corresponding to the user identification information, and determining a current user portrait according to the historical browsing record;
and determining a webpage to be displayed according to the webpage address information, and generating a target webpage content abstract based on the webpage to be displayed and the current user portrait.
Preferably, the searching for the historical browsing record corresponding to the user identification information and determining the current user portrait according to the historical browsing record specifically include:
searching a historical browsing record corresponding to the user identification information in a preset mapping relation table, wherein the preset mapping relation table comprises the corresponding relation between the user identification information and the historical browsing record;
preprocessing the historical browsing records based on a preset text mining algorithm to obtain data to be analyzed;
and establishing a user label model according to the data to be analyzed, and generating a current user portrait according to the user label model.
Preferably, the establishing a behavior analysis model according to the data to be analyzed, and generating a current user portrait according to the behavior analysis model specifically include:
carrying out feature analysis on the data to be classified to obtain text feature data;
determining text similarity through a preset similarity calculation model according to the text characteristic data;
establishing a user tag model according to the text similarity through a preset recommendation algorithm and a preset machine learning algorithm;
and generating a current user portrait according to the user label model and a preset prediction model.
Preferably, the determining a to-be-displayed web page according to the web page address information, and generating a target web page content abstract based on the to-be-displayed web page and the current user portrait specifically includes:
searching a webpage to be displayed corresponding to the webpage address information, and determining words to be processed according to the webpage to be displayed;
counting the occurrence times of the words to be processed in the webpage content to be processed;
determining a weight value of the word to be processed according to the current user portrait, and generating a total score of the word to be processed according to the weight value and the occurrence frequency;
and screening the webpage content to be processed according to the total score to obtain a target webpage content abstract.
Preferably, the searching for the to-be-displayed web page corresponding to the web page address information and determining the to-be-processed word according to the to-be-displayed web page specifically include:
searching a webpage to be displayed corresponding to the webpage address information, and extracting the content of the webpage to be displayed to obtain the content of the webpage to be displayed;
performing data cleaning on the webpage content to be displayed to obtain the webpage content to be processed;
and segmenting words of the webpage content to be processed according to a preset word segmentation model to obtain words to be processed.
Preferably, after determining the content of the webpage to be displayed according to the webpage address information and generating the target webpage content abstract based on the content of the webpage to be displayed and the current user portrait, the method for generating the webpage content abstract further includes:
determining the webpage category of the webpage to be displayed according to the webpage content to be displayed, and generating a display template of the webpage to be displayed according to the webpage category;
and displaying the webpage content to be displayed and the target webpage content abstract on the basis of the display template.
Preferably, when receiving a web page access request, the information extraction is performed on the web page access request to obtain user identification information and web page address information, and the method specifically includes:
when a webpage access request is received, extracting an identifier of the webpage access request to obtain an information identifier;
and screening the information of the webpage access request according to the information identification to obtain user identification information and webpage address information.
In addition, in order to achieve the above object, the present invention further provides a web content summary generating device, which includes a memory, a processor and a web content summary generating program stored in the memory and operable on the processor, wherein the web content summary generating program is configured to implement the steps of the web content summary generating method as described above.
In addition, to achieve the above object, the present invention further provides a storage medium, on which a web content summary generation program is stored, and the web content summary generation program, when executed by a processor, implements the steps of the web content summary generation method as described above.
In addition, in order to achieve the above object, the present invention further provides a web content summary generating device, including: the device comprises an extraction module, a determination module and a generation module;
the extraction module is used for extracting information of the webpage access request when the webpage access request is received, and obtaining user identification information and webpage address information;
the determining module is used for searching a historical browsing record corresponding to the user identification information and determining the current user portrait according to the historical browsing record;
and the generating module is used for determining a webpage to be displayed according to the webpage address information and generating a target webpage content abstract based on the webpage to be displayed and the current user portrait.
In the invention, when a webpage access request is received, information extraction is carried out on the webpage access request, user identification information and webpage address information are obtained, a historical browsing record corresponding to the user identification information is searched, a current user portrait is determined according to the historical browsing record, a webpage to be displayed is determined according to the webpage address information, and a target webpage content abstract is generated based on the webpage to be displayed and the current user portrait; compared with the existing mode of directly displaying the whole webpage content, the method and the device generate the current user portrait according to the historical browsing record and generate the webpage content abstract of the webpage to be displayed according to the current user portrait, overcome the defect that the webpage content abstract of the webpage to be displayed cannot be generated in the prior art, and therefore the webpage content abstract close to the browsing habit of the user can be automatically generated, and the reading efficiency is improved.
Drawings
Fig. 1 is a schematic structural diagram of a web page content summary generation device of a hardware operating environment according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a method for generating a summary of web page contents according to a first embodiment of the present invention;
FIG. 3 is a flowchart illustrating a method for generating a summary of web page contents according to a second embodiment of the present invention;
FIG. 4 is a flowchart illustrating a method for generating a summary of web page contents according to a third embodiment of the present invention;
fig. 5 is a block diagram illustrating a first embodiment of an apparatus for generating a summary of web page contents according to the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Referring to fig. 1, fig. 1 is a schematic structural diagram of a device for generating a summary of web page content in a hardware operating environment according to an embodiment of the present invention.
As shown in fig. 1, the web page content digest generation apparatus may include: a processor 1001, such as a Central Processing Unit (CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), and the optional user interface 1003 may further include a standard wired interface and a wireless interface, and the wired interface for the user interface 1003 may be a USB interface in the present invention. The network interface 1004 may optionally include a standard wired interface, a WIreless interface (e.g., a WIreless-FIdelity (WI-FI) interface). The Memory 1005 may be a Random Access Memory (RAM) Memory or a Non-volatile Memory (NVM), such as a disk Memory. The memory 1005 may alternatively be a storage device separate from the processor 1001.
Those skilled in the art will appreciate that the architecture shown in FIG. 1 does not constitute a limitation of web page content summary generation apparatus and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
As shown in FIG. 1, memory 1005, identified as one type of computer storage medium, may include an operating system, a network communication module, a user interface module, and a web content summarization program.
In the apparatus for generating a summary of web page content shown in fig. 1, the network interface 1004 is mainly used for connecting to a background server and performing data communication with the background server; the user interface 1003 is mainly used for connecting user equipment; the web content digest generation apparatus invokes, through the processor 1001, a web content digest generation program stored in the memory 1005, and executes the web content digest generation method provided by the embodiment of the present invention.
Based on the hardware structure, the embodiment of the webpage content abstract generation method is provided.
Referring to fig. 2, fig. 2 is a flowchart illustrating a method for generating a summary of web page contents according to a first embodiment of the present invention.
Step S10: when a webpage access request is received, information extraction is carried out on the webpage access request, and user identification information and webpage address information are obtained.
It should be noted that the execution subject of this embodiment is the web content summary generating device, where the web content summary generating device may be an electronic device such as a mobile phone, a computer, and a server, and may also be other devices that can achieve the same or similar functions.
It should be understood that the web page access request may be an access request sent by a user terminal device, where the user terminal device may be a device that establishes a communication connection with the web page content summary generation device in advance, and the user terminal device may receive an instruction input by a user; the user identification information may be information for identifying a user identity, for example, a user ID, and the like, which is not limited in this embodiment; the web page address information may be an address of a web site that the user wants to access.
It should be understood that, when a web page access request is received, extracting information from the web page access request to obtain user identification information and web page address information may be, when a web page access request is received, extracting an identification of the web page access request to obtain an information identification, and performing information screening on the web page access request according to the information identification to obtain user identification information and web page address information.
Step S20: and searching a historical browsing record corresponding to the user identification information, and determining the current user portrait according to the historical browsing record.
It can be understood that the historical browsing record corresponding to the user identification information is searched, it is determined according to the historical browsing record that the current user portrait can be the historical browsing record corresponding to the user identification information searched in a preset mapping relation table, the preset mapping relation table comprises the corresponding relation between the user identification information and the historical browsing record, the historical browsing record is preprocessed based on a preset text mining algorithm to obtain data to be analyzed, a user tag model is established according to the data to be analyzed, and the current user portrait is generated according to the user tag model.
It should be noted that the preset mapping relationship table includes a corresponding relationship between the user identification information and the historical browsing records, where the corresponding relationship may be determined according to the historical access information of the user. For example, when a user accesses a webpage, the webpage content abstract generating device correspondingly stores user identification information and access information into a preset mapping relation table; the predetermined text mining algorithm may be at least one algorithm of TF-IDF, TopicModel and LDA, which is not limited by the embodiment.
It should be understood that the preprocessing is performed on the historical browsing records based on the preset text mining algorithm, and the obtaining of the data to be analyzed may be performing data extraction on the historical browsing records based on the preset text mining algorithm to obtain extracted data, and performing data cleaning on the extracted data to obtain the data to be analyzed.
It can be understood that the step of establishing the user tag model according to the data to be analyzed and generating the current user portrait according to the user tag model may be to search a user tag model corresponding to the data to be analyzed in a preset tag library and directly generate the current user portrait based on the user tag model, where the preset tag library includes a corresponding relationship between the data to be analyzed and the user tag model, and the corresponding relationship may be set according to an actual use habit of a user.
Step S30: and determining a webpage to be displayed according to the webpage address information, and generating a target webpage content abstract based on the webpage to be displayed and the current user portrait.
It should be understood that the determining of the web page to be displayed according to the web page address information, and the generating of the target web page content abstract based on the web page to be displayed and the current user portrait may be searching for the web page to be displayed corresponding to the web page address information, determining the word to be processed according to the web page to be displayed, counting the number of occurrences of the word to be processed in the web page content to be processed, determining the weight value of the word to be processed according to the current user portrait, generating the total score of the word to be processed according to the weight value and the number of occurrences, and screening the web page content to be processed according to the total score to obtain the target web page content abstract.
In a specific implementation, for example, the sentence "Geoffrey Hinton is the gold fast of deep learning. and I love deep learning", it is necessary to count the number of times each different word appears in a sentence, for example, "deep" and "learning" both appear twice, and the remaining words appear only once in a sentence.
It should be understood that words that are frequently viewed by the user may be determined based on the current user representation and may be given a higher weight value.
It can be understood that, generating the total score of the word to be processed according to the weight value and the occurrence number may be directly multiplying the weight value and the occurrence number to obtain the total score of the word to be processed.
It can be understood that, the filtering of the to-be-processed web page content according to the total score to obtain the target web page content abstract may be to sort the to-be-processed web page content according to the total score to obtain a sorting result, and generate the target web page content abstract according to the sorting result, for example, to sort the to-be-processed words of the to-be-processed web page content according to the total score from large to small, to use the to-be-processed words ranked in the top five positions as abstract words, and to generate the target web page content abstract according to the abstract words.
In a first embodiment, when a webpage access request is received, information extraction is carried out on the webpage access request, user identification information and webpage address information are obtained, a historical browsing record corresponding to the user identification information is searched, a current user portrait is determined according to the historical browsing record, a webpage to be displayed is determined according to the webpage address information, and a target webpage content abstract is generated on the basis of the webpage to be displayed and the current user portrait; compared with the existing mode of directly displaying the whole webpage content, the method for displaying the webpage content directly generates the current user portrait according to the historical browsing record and generates the webpage content abstract of the webpage to be displayed according to the current user portrait, overcomes the defect that the webpage content abstract of the webpage to be displayed cannot be generated in the prior art, can automatically generate the webpage content abstract similar to the browsing habit of the user, and improves the reading efficiency.
Referring to fig. 3, fig. 3 is a flowchart illustrating a second embodiment of the method for generating a summary of web page content according to the present invention, and the second embodiment of the method for generating a summary of web page content according to the present invention is provided based on the first embodiment illustrated in fig. 2.
In the second embodiment, the step S10 includes:
step S101: when a webpage access request is received, extracting the identification of the webpage access request to obtain an information identification.
It should be noted that the web page access request may be an access request sent by a user terminal device, where the user terminal device may be a device that establishes a communication connection with the web page content summary generation device in advance, and the user terminal device may receive an instruction input by a user; the information identifier may be an identifier for identifying a type of information, such as a user information identifier, and the like, which is not limited in this embodiment.
It should be understood that, the identification extraction is performed on the web page access request, and the information identification may be obtained by traversing the web page access request, obtaining current web page access request information, and determining whether the current web page access request information is of an identification type, when the current web page access request information is of an identification type, the current web page access request information is taken as information to be extracted, and after the traversal of the web page access request is completed, the information identification is generated according to the information to be extracted.
Step S102: and screening the information of the webpage access request according to the information identification to obtain user identification information and webpage address information.
It should be understood that the user identification information may be information for identifying the user identity, such as a user ID, and the like, which is not limited in this embodiment; the web page address information may be an address of a web site that the user wants to access.
It can be understood that, the information screening is performed on the webpage access request according to the information identifier, and the obtaining of the user identifier information and the webpage address information may be to judge whether the information identifier is a user identifier, and when the information identifier is the user identifier, the webpage access request information corresponding to the user identifier is used as the user identifier information; and judging whether the information identifier is a website address identifier, and taking the webpage access request information corresponding to the website address identifier as webpage address information when the information identifier is the website address identifier.
In the second embodiment, when a webpage access request is received, the webpage access request is subjected to identification extraction to obtain an information identification, the webpage access request is subjected to information screening according to the information identification to obtain user identification information and webpage address information, so that the user identification information and the webpage address information can be screened out quickly and accurately, and the processing efficiency is improved.
In the second embodiment, the step S20 includes:
step S201: and searching a historical browsing record corresponding to the user identification information in a preset mapping relation table, wherein the preset mapping relation table comprises the corresponding relation between the user identification information and the historical browsing record.
It should be noted that the preset mapping relationship table includes a corresponding relationship between the user identification information and the historical browsing records, where the corresponding relationship may be determined according to the historical access information of the user. For example, when a user accesses a web page, the web page content summary generation device correspondingly stores user identification information and access information into a preset mapping relationship table.
Step S202: and preprocessing the historical browsing records based on a preset text mining algorithm to obtain data to be analyzed.
It should be noted that the predetermined text mining algorithm may be at least one algorithm of TF-IDF, TopicModel and LDA, which is not limited by the embodiment.
It should be understood that the preprocessing is performed on the historical browsing records based on the preset text mining algorithm, and the obtaining of the data to be analyzed may be performing data extraction on the historical browsing records based on the preset text mining algorithm to obtain extracted data, and performing data cleaning on the extracted data to obtain the data to be analyzed.
Step S203: and establishing a user label model according to the data to be analyzed, and generating a current user portrait according to the user label model.
It can be understood that the step of establishing the user tag model according to the data to be analyzed and generating the current user portrait according to the user tag model may be to search a user tag model corresponding to the data to be analyzed in a preset tag library and directly generate the current user portrait based on the user tag model, where the preset tag library includes a corresponding relationship between the data to be analyzed and the user tag model, and the corresponding relationship may be set according to an actual use habit of a user.
Further, in consideration of practical application, if a user tag model corresponding to data to be analyzed is directly searched in a preset tag library, and a current user portrait is generated based on the user tag model, it is inevitable that objects related to a user portrait generation process are too few, and accuracy is low. To overcome this drawback, step S203 includes:
carrying out feature analysis on the data to be classified to obtain text feature data;
determining text similarity through a preset similarity calculation model according to the text characteristic data;
establishing a user tag model according to the text similarity through a preset recommendation algorithm and a preset machine learning algorithm;
and generating a current user portrait according to the user label model and a preset prediction model.
It can be understood that, the classification analysis is performed on the data to be classified to obtain the text feature data, and the classification analysis can be performed on the data to be classified based on at least one of a classification algorithm model and a clustering algorithm model to obtain the text feature data, where the classification algorithm model can be used to predict new users and information of users with incomplete information, and perform prediction classification on the users, and the classification algorithm can be at least one of KNN, a neural network, a bayesian network, and an SVM, which is not limited in this embodiment; the clustering algorithm model can be used for analyzing and excavating group information with the same characteristics to perform audience segmentation and market segmentation.
It should be noted that the preset similarity calculation model may be at least one of a euclidean distance model, a pearson similarity model, and a cosine similarity model, which is not limited in this embodiment.
It should be noted that the preset recommendation algorithm may be at least one of Apriori algorithm, NBI bipartite graph, FTPree algorithm, and SVD matrix decomposition; the preset machine learning algorithm may be at least one of a feature extraction modeling, a feature selection modeling, and a prediction optimization model, which is not limited in this embodiment.
It should be appreciated that generating the current user representation from the user tag model and the preset predictive model may be a predictive layer of the user tag model input representation generation script, the predictive layer generating the current user representation based on the preset predictive model, for example, supervised learning in machine learning, regression prediction in quantum economics, and linear programming in mathematics.
In the second embodiment, step S30 includes:
step S301: and searching a webpage to be displayed corresponding to the webpage address information, and determining words to be processed according to the webpage to be displayed.
It should be understood that, the searching for the to-be-displayed web page corresponding to the web page address information may be directly analyzing the network address information to obtain the to-be-displayed web page.
It can be understood that the word to be processed is determined according to the webpage to be displayed, and the word to be processed is obtained by directly segmenting the webpage to be displayed by the preset word segmentation model.
Further, in order to improve the accuracy of generating the word to be processed, the step S301 includes:
searching a webpage to be displayed corresponding to the webpage address information, and extracting the content of the webpage to be displayed to obtain the content of the webpage to be displayed;
performing data cleaning on the webpage content to be displayed to obtain the webpage content to be processed;
and segmenting words of the webpage content to be processed according to a preset word segmentation model to obtain words to be processed.
It should be understood that, the content extraction of the web page to be displayed, and obtaining the web page content to be displayed may be based on a content extraction script, and obtain the web page content to be displayed, where the content extraction script may be Beautiful Soup, and the like, which is not limited in this embodiment.
It is understood that the data cleaning of the contents of the web page to be displayed to obtain the contents of the web page to be processed may be to identify an incomplete, incorrect, inaccurate and irrelevant part of the contents of the web page to be displayed, and then replace, modify, or delete the part of the data.
It should be noted that the preset word segmentation model may be at least one of a mechanical word segmentation algorithm, a word segmentation algorithm based on n-gram, a word segmentation algorithm based on hidden markov model, and a word segmentation algorithm based on conditional random field, which is not limited in this embodiment.
Step S302: and counting the occurrence times of the words to be processed in the webpage content to be processed.
In a specific implementation, for example, the sentence "Geoffrey Hinton is the gold fast of deep learning. and I love deep learning", it is necessary to count the number of times each different word appears in a sentence, for example, "deep" and "learning" both appear twice, and the remaining words appear only once in a sentence.
Step S303: and determining the weight value of the word to be processed according to the current user portrait, and generating the total score of the word to be processed according to the weight value and the occurrence times.
It should be understood that words that are frequently viewed by the user may be determined based on the current user representation and may be given a higher weight value.
It can be understood that, generating the total score of the word to be processed according to the weight value and the occurrence number may be directly multiplying the weight value and the occurrence number to obtain the total score of the word to be processed.
Step S304: and screening the webpage content to be processed according to the total score to obtain a target webpage content abstract.
It can be understood that, the filtering of the to-be-processed web page content according to the total score to obtain the target web page content abstract may be to sort the to-be-processed web page content according to the total score to obtain a sorting result, and generate the target web page content abstract according to the sorting result, for example, to sort the to-be-processed words of the to-be-processed web page content according to the total score from large to small, to use the to-be-processed words ranked in the top five positions as abstract words, and to generate the target web page content abstract according to the abstract words.
In a second embodiment, a webpage to be displayed corresponding to the webpage address information is searched, a word to be processed is determined according to the webpage to be displayed, the occurrence frequency of the word to be processed in the webpage content to be processed is counted, a weighted value of the word to be processed is determined according to the current user portrait, a total score of the word to be processed is generated according to the weighted value and the occurrence frequency, the webpage content to be processed is screened according to the total score, a target webpage content abstract is obtained, and therefore the webpage content abstract related to the user portrait can be accurately generated.
Referring to fig. 4, fig. 4 is a flowchart illustrating a method for generating a summary of web page contents according to a third embodiment of the present invention, which is provided based on the first embodiment shown in fig. 2.
In the third embodiment, after the step S30, the method further includes:
step S40: determining the webpage category of the webpage to be displayed according to the webpage content to be displayed, and generating a display template of the webpage to be displayed according to the webpage category.
It should be noted that the category of the web page may be information for identifying a category of the web page to be displayed, for example, identification information for news web pages, identification information for identifying entertainment web pages, and the like, which is not limited in this embodiment.
It should be understood that different display manners of web page contents are different, and therefore, when displaying the target web page summary, corresponding adjustment needs to be performed according to the different display manners.
It can be understood that, the determining the webpage category of the webpage to be displayed according to the content of the webpage to be displayed, and generating the display template of the webpage to be displayed according to the webpage category may be to perform content identification on the content of the webpage to be displayed, determine the webpage category of the webpage to be displayed according to an identification result, and search for the display template corresponding to the webpage category in a preset display template table, where the preset display template table includes a corresponding relationship between the webpage category and the display template, and the corresponding relationship is set according to an actual requirement of a user, which is not limited in this embodiment.
Step S50: and displaying the webpage content to be displayed and the target webpage content abstract on the basis of the display template.
It can be understood that, the displaying the to-be-displayed web page content and the target web page content abstract based on the display template may be aggregating the to-be-displayed web page content and the target web page content abstract onto the display template and displaying the aggregated display template.
In a third embodiment, the webpage category of the webpage to be displayed is determined according to the webpage content to be displayed, the display template of the webpage to be displayed is generated according to the webpage category, and the webpage content to be displayed and the target webpage content abstract are displayed based on the display template, so that the method can adapt to different webpage categories and improve user experience.
In addition, an embodiment of the present invention further provides a storage medium, where the storage medium stores a webpage content summary generating program, and the webpage content summary generating program, when executed by a processor, implements the steps of the webpage content summary generating method described above.
In combination with the embodiment of the present invention, there is also a preferred implementation scheme, when the summary of the web page content is generated and is displayed on an intelligent terminal device (e.g., a smart phone) of a user, browsing focuses corresponding to different users are collected according to the frequency of clicking each position in an interface when the user browses the web page historically, that is, according to experimental tests, even if the same intelligent device is used by different users, favorite screen attention areas and browsing comfort areas corresponding to the web page are different in the use process, and such a difference directly causes that when the corresponding user drags the web page to perform a click browsing operation, the content to be clicked and browsed tends to be dragged to the browsing focus or the browsing comfort area for operation. Therefore, the intelligent terminal equipment is matched with the webpage clicking operation to collect the browsing focus positions corresponding to the historical clicking operations of the users, so that when the webpage content summaries are presented, the content summaries with the highest scores or the highest degree of engagement with the current browsing of the users are generated at the corresponding browsing focus positions or the positions of the browsing comfortable areas, the users can react at the first time, and the interaction process of the most-desired content acquisition is realized through the simplest clicking operation.
In combination with the above preferred scheme, in a specific implementation process, the fine tuning of the web content summary on the browsing focus can be completed by identifying whether the web content summary is operated by a single left hand or a single right hand by a corresponding intelligent terminal, for example: if the browsing focus of the user is close to the left side of the screen, and the intelligent terminal detects that the user is operating with the right hand, at the moment, the frame width of the summary content at the browsing focus can be adaptively adjusted according to the size of the intelligent terminal screen, so that the user can smoothly complete the clicking operation even when the user uses the right-hand single-hand operation with a low matching degree with the browsing focus.
In addition, referring to fig. 5, an embodiment of the present invention further provides a device for generating a summary of web page content, where the device for generating a summary of web page content includes: an extraction module 10, a determination module 20 and a generation module 30;
the extracting module 10 is configured to, when receiving a web page access request, extract information of the web page access request to obtain user identification information and web page address information.
It should be understood that the web page access request may be an access request sent by a user terminal device, where the user terminal device may be a device that establishes a communication connection with the web page content summary generation device in advance, and the user terminal device may receive an instruction input by a user; the user identification information may be information for identifying a user identity, for example, a user ID, and the like, which is not limited in this embodiment; the web page address information may be an address of a web site that the user wants to access.
It should be understood that, when a web page access request is received, extracting information from the web page access request to obtain user identification information and web page address information may be, when a web page access request is received, extracting an identification of the web page access request to obtain an information identification, and performing information screening on the web page access request according to the information identification to obtain user identification information and web page address information.
The determining module 20 is configured to search a historical browsing record corresponding to the user identification information, and determine a current user portrait according to the historical browsing record.
It can be understood that the historical browsing record corresponding to the user identification information is searched, it is determined according to the historical browsing record that the current user portrait can be the historical browsing record corresponding to the user identification information searched in a preset mapping relation table, the preset mapping relation table comprises the corresponding relation between the user identification information and the historical browsing record, the historical browsing record is preprocessed based on a preset text mining algorithm to obtain data to be analyzed, a user tag model is established according to the data to be analyzed, and the current user portrait is generated according to the user tag model.
It should be noted that the preset mapping relationship table includes a corresponding relationship between the user identification information and the historical browsing records, where the corresponding relationship may be determined according to the historical access information of the user. For example, when a user accesses a webpage, the webpage content abstract generating device correspondingly stores user identification information and access information into a preset mapping relation table; the predetermined text mining algorithm may be at least one algorithm of TF-IDF, TopicModel and LDA, which is not limited by the embodiment.
It should be understood that the preprocessing is performed on the historical browsing records based on the preset text mining algorithm, and the obtaining of the data to be analyzed may be performing data extraction on the historical browsing records based on the preset text mining algorithm to obtain extracted data, and performing data cleaning on the extracted data to obtain the data to be analyzed.
It can be understood that the step of establishing the user tag model according to the data to be analyzed and generating the current user portrait according to the user tag model may be to search a user tag model corresponding to the data to be analyzed in a preset tag library and directly generate the current user portrait based on the user tag model, where the preset tag library includes a corresponding relationship between the data to be analyzed and the user tag model, and the corresponding relationship may be set according to an actual use habit of a user.
The generating module 30 is configured to determine a webpage to be displayed according to the webpage address information, and generate a target webpage content abstract based on the webpage to be displayed and the current user portrait.
It should be understood that the determining of the web page to be displayed according to the web page address information, and the generating of the target web page content abstract based on the web page to be displayed and the current user portrait may be searching for the web page to be displayed corresponding to the web page address information, determining the word to be processed according to the web page to be displayed, counting the number of occurrences of the word to be processed in the web page content to be processed, determining the weight value of the word to be processed according to the current user portrait, generating the total score of the word to be processed according to the weight value and the number of occurrences, and screening the web page content to be processed according to the total score to obtain the target web page content abstract.
In a specific implementation, for example, the sentence "Geoffrey Hinton is the gold fast of deep learning. and I love deep learning", it is necessary to count the number of times each different word appears in a sentence, for example, "deep" and "learning" both appear twice, and the remaining words appear only once in a sentence.
It should be understood that words that are frequently viewed by the user may be determined based on the current user representation and may be given a higher weight value.
It can be understood that, generating the total score of the word to be processed according to the weight value and the occurrence number may be directly multiplying the weight value and the occurrence number to obtain the total score of the word to be processed.
It can be understood that, the filtering of the to-be-processed web page content according to the total score to obtain the target web page content abstract may be to sort the to-be-processed web page content according to the total score to obtain a sorting result, and generate the target web page content abstract according to the sorting result, for example, to sort the to-be-processed words of the to-be-processed web page content according to the total score from large to small, to use the to-be-processed words ranked in the top five positions as abstract words, and to generate the target web page content abstract according to the abstract words.
In this embodiment, when a webpage access request is received, extracting information of the webpage access request to obtain user identification information and webpage address information, searching a historical browsing record corresponding to the user identification information, determining a current user portrait according to the historical browsing record, determining a webpage to be displayed according to the webpage address information, and generating a target webpage content abstract based on the webpage to be displayed and the current user portrait; compared with the existing mode of directly displaying the whole webpage content, the method for displaying the webpage content directly generates the current user portrait according to the historical browsing record and generates the webpage content abstract of the webpage to be displayed according to the current user portrait, overcomes the defect that the webpage content abstract of the webpage to be displayed cannot be generated in the prior art, can automatically generate the webpage content abstract similar to the browsing habit of the user, and improves the reading efficiency.
In an embodiment, the determining module 20 is further configured to search a historical browsing record corresponding to the user identification information in a preset mapping relation table, where the preset mapping relation table includes a corresponding relation between the user identification information and the historical browsing record, pre-process the historical browsing record based on a preset text mining algorithm to obtain data to be analyzed, establish a user tag model according to the data to be analyzed, and generate a current user portrait according to the user tag model;
in an embodiment, the determining module 20 is further configured to perform feature analysis on the data to be classified to obtain text feature data, determine text similarity through a preset similarity calculation model according to the text feature data, establish a user tag model through a preset recommendation algorithm and a preset machine learning algorithm according to the text similarity, and generate a current user portrait according to the user tag model and a preset prediction model;
in an embodiment, the generating module 30 is further configured to perform feature analysis on the data to be classified to obtain text feature data, determine text similarity through a preset similarity calculation model according to the text feature data, establish a user tag model through a preset recommendation algorithm and a preset machine learning algorithm according to the text similarity, and generate a current user portrait according to the user tag model and a preset prediction model;
in an embodiment, the generating module 30 is further configured to search for a to-be-displayed web page corresponding to the web page address information, determine a word to be processed according to the to-be-displayed web page, count the occurrence frequency of the word to be processed in the to-be-processed web page content, determine a weight value of the word to be processed according to the current user portrait, generate a total score of the word to be processed according to the weight value and the occurrence frequency, and filter the to-be-processed web page content according to the total score to obtain a target web page content abstract;
in one embodiment, the apparatus for generating a summary of web page content further includes: a display module;
the display module is used for determining the webpage category of the webpage to be displayed according to the webpage content to be displayed, generating a display template of the webpage to be displayed according to the webpage category, and displaying the webpage content to be displayed and the target webpage content abstract on the basis of the display template;
in an embodiment, the extracting module 10 is configured to, when a web page access request is received, perform identifier extraction on the web page access request to obtain an information identifier, perform information screening on the web page access request according to the information identifier, and obtain user identifier information and web page address information.
Other embodiments or specific implementation manners of the web content summary generation apparatus according to the present invention may refer to the above method embodiments, and are not described herein again.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order, but rather the words first, second, third, etc. are to be interpreted as names.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention or portions thereof that contribute to the prior art may be embodied in the form of a software product, where the computer software product is stored in a storage medium (e.g., a Read Only Memory (ROM)/Random Access Memory (RAM), a magnetic disk, an optical disk), and includes several instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A method for generating a summary of web page contents is characterized in that the method for generating the summary of web page contents comprises the following steps:
when a webpage access request is received, extracting information of the webpage access request to obtain user identification information and webpage address information;
searching a historical browsing record corresponding to the user identification information, and determining a current user portrait according to the historical browsing record;
and determining a webpage to be displayed according to the webpage address information, and generating a target webpage content abstract based on the webpage to be displayed and the current user portrait.
2. The method for generating a summary of web page contents according to claim 1, wherein the searching for the historical browsing record corresponding to the user identification information and determining the current user portrait according to the historical browsing record specifically includes:
searching a historical browsing record corresponding to the user identification information in a preset mapping relation table, wherein the preset mapping relation table comprises the corresponding relation between the user identification information and the historical browsing record;
preprocessing the historical browsing records based on a preset text mining algorithm to obtain data to be analyzed;
and establishing a user label model according to the data to be analyzed, and generating a current user portrait according to the user label model.
3. The method for generating a summary of web page contents according to claim 2, wherein the establishing a behavior analysis model according to the data to be analyzed and generating a current user portrait according to the behavior analysis model specifically includes:
carrying out feature analysis on the data to be classified to obtain text feature data;
determining text similarity through a preset similarity calculation model according to the text characteristic data;
establishing a user tag model according to the text similarity through a preset recommendation algorithm and a preset machine learning algorithm;
and generating a current user portrait according to the user label model and a preset prediction model.
4. The method for generating a summary of web page contents according to claim 1, wherein the determining a web page to be displayed according to the web page address information, and generating a summary of target web page contents based on the web page to be displayed and the current user portrait comprises:
searching a webpage to be displayed corresponding to the webpage address information, and determining words to be processed according to the webpage to be displayed;
counting the occurrence times of the words to be processed in the webpage content to be processed;
determining a weight value of the word to be processed according to the current user portrait, and generating a total score of the word to be processed according to the weight value and the occurrence frequency;
and screening the webpage content to be processed according to the total score to obtain a target webpage content abstract.
5. The method for generating the summary of the web page content according to claim 4, wherein the searching for the web page to be displayed corresponding to the web page address information and determining the word to be processed according to the web page to be displayed specifically comprises:
searching a webpage to be displayed corresponding to the webpage address information, and extracting the content of the webpage to be displayed to obtain the content of the webpage to be displayed;
performing data cleaning on the webpage content to be displayed to obtain the webpage content to be processed;
and segmenting words of the webpage content to be processed according to a preset word segmentation model to obtain words to be processed.
6. The method for generating a summary of web page contents according to claim 5, wherein after determining the web page contents to be displayed according to the web page address information and generating a summary of target web page contents based on the web page contents to be displayed and the current user representation, the method for generating a summary of web page contents further comprises:
determining the webpage category of the webpage to be displayed according to the webpage content to be displayed, and generating a display template of the webpage to be displayed according to the webpage category;
and displaying the webpage content to be displayed and the target webpage content abstract on the basis of the display template.
7. The method for generating a summary of web page contents according to any one of claims 1 to 6, wherein when receiving a web page access request, extracting information of the web page access request to obtain user identification information and web page address information specifically includes:
when a webpage access request is received, extracting an identifier of the webpage access request to obtain an information identifier;
and screening the information of the webpage access request according to the information identification to obtain user identification information and webpage address information.
8. A web content digest generation apparatus, characterized by comprising: a memory, a processor and a web content summary generation program stored on the memory and executable on the processor, the web content summary generation program implementing the web content summary generation method according to any one of claims 1 to 7 when executed by the processor.
9. A storage medium having stored thereon a web content digest generation program which, when executed by a processor, implements the web content digest generation method according to any one of claims 1 to 7.
10. A web content digest generation apparatus, comprising: the device comprises an extraction module, a determination module and a generation module;
the extraction module is used for extracting information of the webpage access request when the webpage access request is received, and obtaining user identification information and webpage address information;
the determining module is used for searching a historical browsing record corresponding to the user identification information and determining the current user portrait according to the historical browsing record;
and the generating module is used for determining a webpage to be displayed according to the webpage address information and generating a target webpage content abstract based on the webpage to be displayed and the current user portrait.
CN202010983986.4A 2020-09-18 2020-09-18 Webpage content abstract generation method, equipment, storage medium and device Pending CN112182451A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010983986.4A CN112182451A (en) 2020-09-18 2020-09-18 Webpage content abstract generation method, equipment, storage medium and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010983986.4A CN112182451A (en) 2020-09-18 2020-09-18 Webpage content abstract generation method, equipment, storage medium and device

Publications (1)

Publication Number Publication Date
CN112182451A true CN112182451A (en) 2021-01-05

Family

ID=73920217

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010983986.4A Pending CN112182451A (en) 2020-09-18 2020-09-18 Webpage content abstract generation method, equipment, storage medium and device

Country Status (1)

Country Link
CN (1) CN112182451A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114969558A (en) * 2022-08-03 2022-08-30 安徽商信政通信息技术股份有限公司 User portrait generation method and system based on user behavior habit analysis
CN116578793A (en) * 2023-07-03 2023-08-11 广州趣米网络科技有限公司 Front-end page design method and system

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101458718A (en) * 2009-01-05 2009-06-17 北京大学 Search engine dynamic summarization extracting method
CN102446191A (en) * 2010-10-13 2012-05-09 北京创新方舟科技有限公司 Method for generating webpage content abstracts and equipment and system adopting same
KR101541170B1 (en) * 2014-10-21 2015-08-03 (주)센솔로지 Apparatus and method for summarizing text
CN106776860A (en) * 2016-11-28 2017-05-31 北京三快在线科技有限公司 One kind search abstraction generating method and device
CN108121802A (en) * 2017-12-22 2018-06-05 东软集团股份有限公司 The thermodynamic analysis method, apparatus and its equipment of web page access
CN108288229A (en) * 2018-03-02 2018-07-17 北京邮电大学 A kind of user's portrait construction method
CN108776676A (en) * 2018-02-02 2018-11-09 腾讯科技(深圳)有限公司 Information recommendation method, device, computer-readable medium and electronic equipment
CN110837556A (en) * 2019-10-30 2020-02-25 深圳价值在线信息科技股份有限公司 Abstract generation method and device, terminal equipment and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101458718A (en) * 2009-01-05 2009-06-17 北京大学 Search engine dynamic summarization extracting method
CN102446191A (en) * 2010-10-13 2012-05-09 北京创新方舟科技有限公司 Method for generating webpage content abstracts and equipment and system adopting same
KR101541170B1 (en) * 2014-10-21 2015-08-03 (주)센솔로지 Apparatus and method for summarizing text
CN106776860A (en) * 2016-11-28 2017-05-31 北京三快在线科技有限公司 One kind search abstraction generating method and device
CN108121802A (en) * 2017-12-22 2018-06-05 东软集团股份有限公司 The thermodynamic analysis method, apparatus and its equipment of web page access
CN108776676A (en) * 2018-02-02 2018-11-09 腾讯科技(深圳)有限公司 Information recommendation method, device, computer-readable medium and electronic equipment
CN108288229A (en) * 2018-03-02 2018-07-17 北京邮电大学 A kind of user's portrait construction method
CN110837556A (en) * 2019-10-30 2020-02-25 深圳价值在线信息科技股份有限公司 Abstract generation method and device, terminal equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114969558A (en) * 2022-08-03 2022-08-30 安徽商信政通信息技术股份有限公司 User portrait generation method and system based on user behavior habit analysis
CN116578793A (en) * 2023-07-03 2023-08-11 广州趣米网络科技有限公司 Front-end page design method and system
CN116578793B (en) * 2023-07-03 2024-01-26 广州趣米网络科技有限公司 Front-end page design method and system

Similar Documents

Publication Publication Date Title
CN101542486B (en) Rank graph
CN111444428A (en) Information recommendation method and device based on artificial intelligence, electronic equipment and storage medium
US7917514B2 (en) Visual and multi-dimensional search
US8355997B2 (en) Method and system for developing a classification tool
CN112395506A (en) Information recommendation method and device, electronic equipment and storage medium
US8484194B1 (en) Training set construction for taxonomic classification
CN110825957A (en) Deep learning-based information recommendation method, device, equipment and storage medium
CN103136228A (en) Image search method and image search device
CN110888990A (en) Text recommendation method, device, equipment and medium
JP4538760B2 (en) Information processing apparatus and method, program, and recording medium
CN109634436B (en) Method, device, equipment and readable storage medium for associating input method
WO2018171295A1 (en) Method and apparatus for tagging article, terminal, and computer readable storage medium
TWI457775B (en) Method for sorting and managing websites and electronic device of executing the same
CN112182451A (en) Webpage content abstract generation method, equipment, storage medium and device
CN112579893A (en) Information pushing method, information display method, information pushing device, information display device and information display equipment
Aung et al. Random forest classifier for multi-category classification of web pages
JPWO2020095357A1 (en) Search needs evaluation device, search needs evaluation system, and search needs evaluation method
CN113450147A (en) Product matching method, device and equipment based on decision tree and storage medium
CN111950265A (en) Domain lexicon construction method and device
CN111274483A (en) Associated recommendation method and associated recommendation interaction method
CN115186240A (en) Social network user alignment method, device and medium based on relevance information
KR102322212B1 (en) Apparatus and method for recommending learning contents
CN113821596A (en) Information recommendation method and device, computer equipment and storage medium
JP6924450B2 (en) Search needs evaluation device, search needs evaluation system, and search needs evaluation method
CN110147488B (en) Page content processing method, processing device, computing equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210105