CN116263990A - Floor page feature extraction method, device, equipment and storage medium - Google Patents

Floor page feature extraction method, device, equipment and storage medium Download PDF

Info

Publication number
CN116263990A
CN116263990A CN202111519476.2A CN202111519476A CN116263990A CN 116263990 A CN116263990 A CN 116263990A CN 202111519476 A CN202111519476 A CN 202111519476A CN 116263990 A CN116263990 A CN 116263990A
Authority
CN
China
Prior art keywords
landing page
multimedia information
information
picture
advertisement
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111519476.2A
Other languages
Chinese (zh)
Inventor
於光中
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN202111519476.2A priority Critical patent/CN116263990A/en
Publication of CN116263990A publication Critical patent/CN116263990A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a floor page feature extraction method, a floor page feature extraction device, floor page feature extraction equipment and a storage medium, and belongs to the technical field of Internet, wherein the floor page feature extraction method comprises the following steps: acquiring a system log corresponding to the multimedia information; obtaining a corresponding landing page picture according to the system log; performing character recognition on the landing page picture to obtain character information; determining landing page description information corresponding to the landing page picture according to the text information; and generating landing page characteristics according to the landing page description information. Therefore, the technology can be combined to accurately extract and describe the landing page characteristics of the multimedia information, the landing page characteristics of the multimedia information can be described more accurately, and the conversion rate of the multimedia information is improved.

Description

Floor page feature extraction method, device, equipment and storage medium
Technical Field
The present invention relates to the field of internet technologies, and in particular, to a method, an apparatus, a device, and a storage medium for extracting a landing page feature.
Background
With the increasing growth of the internet and the continuous development of electronic commerce, commercial traffic has become an important business for many internet companies. The probability of clicking the recommended multimedia information by the user can be predicted Through a Click-Through-Rate (CTR) model, however, the existing model feature engineering is to extract the basic attributes of the user and the multimedia information, and the specific description feature of the landing page of the multimedia information is lacking, so that the conversion Rate of the multimedia information is not high.
The foregoing is provided merely for the purpose of facilitating understanding of the technical solutions of the present invention and is not intended to represent an admission that the foregoing is prior art.
Disclosure of Invention
The invention mainly aims to provide a method, a device, equipment and a storage medium for extracting floor page characteristics, which aim to solve the technical problem of how to accurately extract description characteristics in a multimedia information floor page.
In order to achieve the above object, the present invention provides a landing page feature extraction method, including:
acquiring a system log corresponding to the multimedia information;
obtaining a corresponding landing page picture according to the system log;
performing character recognition on the landing page picture to obtain character information;
determining landing page description information corresponding to the landing page picture according to the text information;
and generating landing page characteristics according to the landing page description information.
Optionally, the obtaining the corresponding floor page picture according to the system log includes:
determining a plurality of multimedia information marks according to the system log, and clicking links corresponding to the multimedia information marks;
and obtaining a corresponding landing page picture according to the click link.
Optionally, the obtaining the corresponding landing page picture according to the click link includes:
Accessing a corresponding multimedia information page according to the click link;
and performing screen capturing processing based on the multimedia information page to obtain a corresponding landing page picture.
Optionally, the accessing the corresponding multimedia information page according to the click link includes:
calling a preset browser to access a multimedia information page corresponding to the click link;
correspondingly, the screen capturing process is performed based on the multimedia information page to obtain a corresponding landing page picture, which comprises the following steps:
calling a preset interface of the preset browser to perform screen capturing processing based on a multimedia information page, and obtaining screen capturing information corresponding to the multimedia information page;
and obtaining a corresponding landing page picture according to the screen capturing information.
Optionally, the determining a plurality of multimedia information marks according to the system log, and clicking links corresponding to the multimedia information marks, includes:
determining a plurality of multimedia information marks according to the system log;
detecting whether a repeated mark exists in the multimedia information mark;
and if the repeated marks do not exist in the multimedia information marks, determining clicking links corresponding to the multimedia information marks according to the system log.
Optionally, after detecting whether there is a repeated mark in the multimedia information mark, the method further includes:
if the repeated mark exists in the multimedia information mark, performing de-duplication processing on the multimedia information mark to obtain a de-duplicated multimedia information mark;
and determining clicking links corresponding to the multimedia information identifiers after the duplication removal according to the system log.
Optionally, the performing text recognition on the landing page picture to obtain text information includes:
converting the landing page picture into a to-be-processed landing page picture in a preset format;
and carrying out character recognition on the to-be-processed landing page picture through a preset character recognition tool to obtain character information contained in the to-be-processed landing page picture.
Optionally, the determining, according to the text information, landing page description information corresponding to the landing page picture includes:
extracting candidate subject words from the text information through a preset natural language processing tool;
determining a subject word of a landing page according to the candidate subject word;
and taking the landing page subject words as landing page description information for describing the landing page pictures.
Optionally, the generating a landing page feature according to the landing page description information includes:
Traversing the multimedia information mark, and taking the traversed multimedia information mark as a current mark;
and splicing the landing page description information corresponding to the current identifier to obtain the landing page characteristics.
Optionally, after generating the landing page feature according to the landing page description information, the method further includes:
generating a training sample according to the landing page characteristics;
training a preset click through rate model according to the training sample to obtain a target click through rate model;
and predicting the probability of clicking the multimedia information by the user through the target click passing rate model.
Optionally, training the preset click through rate model according to the training sample to obtain a target click through rate model, including:
aggregating the training samples in the embedding dimension to obtain aggregated training samples;
training the preset click through rate model according to the aggregated training samples to obtain a target click through rate model.
Optionally, the generating training samples according to the landing page features includes:
extracting multimedia information side characteristics and user side characteristics from the system log;
generating an initial sample according to the multimedia information side characteristics and the user side characteristics;
And generating a training sample according to the landing page characteristics and the initial sample.
In addition, in order to achieve the above object, the present invention also proposes a landing page feature extraction device, including:
the system log module is used for acquiring a system log corresponding to the multimedia information;
the landing page picture module is used for obtaining a corresponding landing page picture according to the system log;
the character recognition module is used for carrying out character recognition on the landing page picture to obtain character information;
the description information module is used for determining landing page description information corresponding to the landing page picture according to the text information;
and the characteristic construction module is used for generating landing page characteristics according to the landing page description information.
Optionally, the landing page picture module is further configured to determine a plurality of multimedia information marks according to the system log, and click links corresponding to the multimedia information marks; and obtaining a corresponding landing page picture according to the click link.
Optionally, the landing page picture module is further configured to access a corresponding multimedia information page according to the click link; and performing screen capturing processing based on the multimedia information page to obtain a corresponding landing page picture.
Optionally, the floor page picture module is further configured to invoke a preset browser to access a multimedia information page corresponding to the click link; calling a preset interface of the preset browser to perform screen capturing processing based on a multimedia information page, and obtaining screen capturing information corresponding to the multimedia information page; and obtaining a corresponding landing page picture according to the screen capturing information.
Optionally, the landing page picture module is further configured to determine a plurality of multimedia information marks according to the system log; detecting whether a repeated mark exists in the multimedia information mark; and if the repeated marks do not exist in the multimedia information marks, determining clicking links corresponding to the multimedia information marks according to the system log.
Optionally, the landing page picture module is further configured to, if there is a repeated mark in the multimedia information mark, perform deduplication processing on the multimedia information mark to obtain a deduplicated multimedia information identifier; and determining clicking links corresponding to the multimedia information identifiers after the duplication removal according to the system log.
In addition, in order to achieve the above object, the present invention also proposes a landing page feature extraction apparatus including: the system comprises a memory, a processor and a landing page feature extraction program which is stored in the memory and can run on the processor, wherein the landing page feature extraction program realizes the landing page feature extraction method when being executed by the processor.
In addition, in order to achieve the above object, the present invention also proposes a storage medium having stored thereon a landing page feature extraction program which, when executed by a processor, implements the landing page feature extraction method as described above.
In the floor page feature extraction method provided by the invention, a system log corresponding to the multimedia information is obtained; obtaining a corresponding landing page picture according to the system log; performing character recognition on the landing page picture to obtain character information; determining landing page description information corresponding to the landing page picture according to the text information; and generating landing page characteristics according to the landing page description information. Therefore, the technology can be combined to accurately extract and describe the landing page characteristics of the multimedia information, the landing page characteristics of the multimedia information can be described more accurately, and the conversion rate of the multimedia information is improved.
Drawings
FIG. 1 is a schematic diagram of a floor page feature extraction device of a hardware operating environment according to an embodiment of the present invention;
FIG. 2 is a flowchart of a first embodiment of a method for extracting features of a landing page according to the present invention;
FIG. 3 is a flowchart illustrating a second embodiment of a landing page feature extraction method according to the present invention;
FIG. 4 is a flowchart illustrating a third embodiment of a landing page feature extraction method according to the present invention;
fig. 5 is a schematic functional block diagram of a first embodiment of the landing page feature extraction device of the present invention.
The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
Referring to fig. 1, fig. 1 is a schematic structural diagram of a floor page feature extraction device of a hardware running environment according to an embodiment of the present invention.
As shown in fig. 1, the landing page feature extraction apparatus may include: a processor 1001, such as a central processing unit (Central Processing Unit, CPU), a communication bus 1002, a user interface 1003, a network interface 1004, a memory 1005. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display, an input unit such as keys, and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., wi-Fi interface). The memory 1005 may be a high-speed random access memory (Random Access Memory, RAM) or a stable memory (non-volatile memory), such as a disk memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.
It will be appreciated by those skilled in the art that the apparatus structure shown in fig. 1 does not constitute a limitation of the landing page feature extraction apparatus, and may include more or fewer components than shown, or may combine certain components, or may have a different arrangement of components.
As shown in fig. 1, an operating system, a network communication module, a user interface module, and a landing page feature extraction program may be included in the memory 1005 as one type of storage medium.
In the landing page feature extraction apparatus shown in fig. 1, the network interface 1004 is mainly used for connecting to an external network and performing data communication with other network apparatuses; the user interface 1003 is mainly used for connecting user equipment and communicating data with the user equipment; the apparatus of the present invention calls the landing page feature extraction program stored in the memory 1005 through the processor 1001, and executes the landing page feature extraction method provided by the embodiment of the present invention.
Based on the hardware structure, the embodiment of the invention provides a floor page feature extraction method.
Referring to fig. 2, fig. 2 is a flowchart illustrating a first embodiment of a landing page feature extraction method according to the present invention.
In a first embodiment, the landing page feature extraction method includes:
step S10, a system log corresponding to the multimedia information is obtained.
Note that, the execution body of the present embodiment may be a landing page feature extraction device, and the landing page feature extraction device may be a computer device with a data processing function, or may be another device that may implement the same or similar functions, which is not limited in this embodiment, and in this embodiment, a computer device is described as an example.
It should be noted that, the multimedia information in this embodiment may include, but is not limited to, advertisement, video, picture, audio, and other multimedia information, and this embodiment is not limited thereto, and in this embodiment, the multimedia information is taken as an advertisement for illustration. Accordingly, the system log corresponding to the multimedia information in the embodiment may be an advertisement system log corresponding to the advertisement. The advertisement may include, but is not limited to, various types of advertisements such as a picture advertisement, a text advertisement, a keyword advertisement, a ranking advertisement, and a video advertisement.
It should be appreciated that with the increasing growth of the internet and the continued development of electronic commerce, commercial traffic has become an important business for many internet companies. Among these, computing advertising is receiving widespread attention as an emerging interdisciplinary. Computing advertising encompasses a wide range of theories and techniques including advertising, information retrieval, text analysis, statistical models, machine learning, micro-economics, and the like. Computing advertisements refers to the use of big data technology to provide specific advertisements to characteristic internet audiences and has long been an important means of commercial rendering for internet companies. CTR is a main research point in the field of computing advertisements, and click rate estimation of online advertisements can be performed through a CTR model, so that the CTR model plays a vital role in internet advertisement service.
It should be appreciated that click rate estimates are typically used to determine the probability that a user clicks on a recommended advertisement. The advertisement system typically recommends advertisements to the user based on the results of the predictions of the CTR model, which also includes factors for advertiser bidding. The good CTR model can create maximum benefits for the advertising system, and can bring good experience to users. Modern advertising systems are largely divided into search advertisements and show advertisements, where search advertisements are the most modular and fastest growing forms of advertisements, but in both forms of advertisements, the CTR model is one of the most critical technologies.
It should be appreciated that the development of the CTR model goes from a linear model to a deep model, requiring pre-extraction of model features for both stages of development. The existing model feature engineering is the extraction of basic attributes of users and advertisements. The depth model has more excellent feature crossing capability than the linear model. But the existing model feature engineering lacks specific description features for the advertisement landing page. Therefore, the method mainly solves the problem of how to extract the characteristics of the advertisement landing page, and obtains more accurate information to describe the advertisement landing page, so that the conversion rate of the advertisement is improved.
It should be noted that, the scheme can be mainly divided into five parts, namely, an advertisement system log, an advertisement landing page picture interception, a text recognition, a theme extraction and a construction feature generation sample. The advertisement system log is a log obtained by recording advertisement data recommended by a daily system, and comprises information of multiple dimensions such as display, clicking, consumption and the like. The initial sample of the CTR model is to extract the corresponding advertisement side features and user side features according to the system logs, so as to perform model training.
Therefore, the system log corresponding to the multimedia information can be obtained and stored in the database by recording the multimedia information recommended by the system every day in advance, and the corresponding system log is obtained from the database when the landing page feature of the multimedia information is carried out.
And step S20, obtaining a corresponding floor page picture according to the system log.
It should be noted that, the landing page is also called a landing page, and refers to a first page that a user enters by clicking on an advertisement material or a link. The vast majority of landing pages are used for marketing or advertising campaigns, where landing pages are pages that are displayed to potential users after clicking on advertisements using a search engine.
It should be understood that, according to the system log, the advertisement landing page picture corresponding to each advertisement contained in the system log can be obtained, wherein, in order to facilitate distinguishing each advertisement, corresponding identifiers, i.e. advertisement marks, can be set for different advertisements. The label in this embodiment may be an ID, or may be another label that can achieve the same or similar functions, which is not limited in this embodiment, and in this embodiment, an advertisement label is described as an advertisement ID.
It can be understood that the advertisements recommended by the system can be determined according to the system log, the advertisement IDs corresponding to the advertisements can be obtained, meanwhile, the clicking links corresponding to the advertisement IDs can be determined according to the system log, the advertisement landing page corresponding to the advertisements can be accessed by accessing the clicking links, and then the landing page pictures corresponding to the advertisement landing page can be intercepted.
And step S30, performing character recognition on the landing page picture to obtain character information.
It should be appreciated that after the landing page picture corresponding to the advertisement landing page of each advertisement is obtained, text recognition may be performed on the landing page picture, so that text information contained in the landing page picture is recognized.
It can be understood that the text recognition can be performed on the landing page picture through a preset text recognition tool, so that text information contained in the landing page picture is obtained. The preset character recognition tool may include, but is not limited to, optical character recognition (Optical Character Recognition, OCR) character recognition software, which is not limited in this embodiment.
It will be appreciated that after the landing page picture is obtained, the landing page picture may be passed into an OCR service, through which text information contained in the landing page picture is identified.
It should be understood that although the OCR service may accept various formats of pictures, because of different efficiency in performing text recognition on pictures of different formats by the OCR service, in order to achieve better text recognition efficiency, the landing page picture may be converted into a to-be-processed landing page picture of a preset format, and then text recognition is performed on the to-be-processed landing page picture by a preset text recognition tool, so as to obtain text information included in the to-be-processed landing page picture.
It should be noted that, the preset format in the present embodiment may include, but is not limited to, a Base64 coding format, which is not limited to this embodiment. The floor page picture can be converted into the floor page picture to be processed in the Base64 coding format, and then the floor page picture to be processed in the Base64 coding format is transmitted to the OCR service, so that text content information contained in the floor page picture to be processed is obtained and stored in a local disk, and the efficiency of the whole flow is optimized.
And step S40, determining landing page description information corresponding to the landing page picture according to the text information.
It should be appreciated that after determining the text information contained in the landing page picture, landing page description information corresponding to the landing page picture may be determined according to the text information. The landing page description information may include, but is not limited to, a landing page subject word, which is described as an example in the present embodiment.
Further, in order to more accurately obtain the landing page description information corresponding to the landing page picture, the landing page description information is more targeted, and the determining the landing page description information corresponding to the landing page picture according to the text information includes:
extracting candidate subject words from the text information through a preset natural language processing tool; determining a subject word of a landing page according to the candidate subject word; and taking the landing page subject words as landing page description information for describing the landing page pictures.
It should be noted that, the preset natural language processing tool in this embodiment may be a natural language processing (Natural Language Processing, NLP) model, where the NLP model may include, but is not limited to, TF-IDF algorithm, word2vec algorithm, LDA topic model, etc., and these models may all implement topic Word extraction, and in addition, may include other models that may implement the same or similar functions, which is not limited in this embodiment. The TF-IDF algorithm is preferred to extract the subject terms in this embodiment because of its simplicity compared to other models and algorithms.
In a specific implementation, after the text information of the landing page picture corresponding to all the advertisement IDs is extracted, candidate subject terms may be extracted from the text information by TF-IDF algorithm, and then the landing page subject terms corresponding to the landing page picture may be determined according to the corresponding relationship between the text information and the landing page picture and the candidate subject terms, where the number of subject terms in this embodiment may be one or more, which is not limited in this embodiment.
It is understood that after determining the landing page subject word corresponding to the landing page picture, the landing page subject word may be used as landing page description information for describing the landing page picture, so that a final landing page subject word may be generated for each advertisement ID to describe the content of the landing page thereof.
And S50, generating landing page characteristics according to the landing page description information.
It should be understood that after determining the landing page description information of the landing page corresponding to each advertisement ID, the advertisement ID may be traversed, and the traversed advertisement ID is used as the current identifier, then all the landing page description information corresponding to the current identifier is acquired, and then the landing page description information corresponding to the current identifier is spliced to obtain the landing page feature corresponding to the current identifier. And then taking the traversed next advertisement ID as the current identifier, and repeating the steps to obtain the landing page characteristics corresponding to the current identifier until all the advertisement IDs are traversed, so that each advertisement ID has the corresponding landing page characteristics.
It is understood that, since the landing page description information in the present embodiment may be a landing page subject word, the stitching of the landing page description information may be substantially stitching the landing page subject word.
In a specific implementation, it is assumed that three advertisement IDs, namely advertisement 1, advertisement 2 and advertisement 3, are respectively used, and when traversing, advertisement 1 is first used as a current identifier, all landing page description information corresponding to advertisement 1 can be spliced, so as to obtain the landing page characteristics corresponding to advertisement 1. Then, performing next traversal, taking the advertisement 2 as a current identifier, and splicing all landing page description information corresponding to the advertisement 2 to obtain the landing page characteristics corresponding to the advertisement 2. And then continuing to traverse, taking the advertisement 3 as a current identifier, and splicing all landing page description information corresponding to the advertisement 3 to obtain landing page characteristics corresponding to the advertisement 3.
It should be noted that the present solution belongs to a combination innovative form, and the beneficial effects are achieved by combining multiple technologies. The processing of the click log, automatic webpage screenshot, real-time OCR recognition, extraction of text subject words and the like are all cores of the scheme. In scheme details, the efficiency and the accuracy of automatic webpage screenshot and OCR recognition are optimized, and the accurate extraction and the depiction of the characteristics of the advertisement landing page are realized by combining a plurality of technologies. According to the method and the system, the floor page characteristics of the advertisements can be described more accurately, the probability of whether effective conversion is formed after a user clicks a certain advertisement is described through the characteristics, and after the characteristics are added into the model, the estimated index of the model is improved to a certain extent, so that the recommendation accuracy and income of the whole advertisement system are improved.
In this embodiment, a system log corresponding to the multimedia information is obtained; obtaining a corresponding landing page picture according to the system log; performing character recognition on the landing page picture to obtain character information; determining landing page description information corresponding to the landing page picture according to the text information; and generating landing page characteristics according to the landing page description information. Therefore, the technology can be combined to accurately extract and describe the landing page characteristics of the multimedia information, the landing page characteristics of the multimedia information can be described more accurately, and the conversion rate of the multimedia information is improved.
In an embodiment, as shown in fig. 3, a second embodiment of the landing page feature extraction method according to the present invention is proposed based on the first embodiment, and the step S20 includes:
step S201, determining a plurality of multimedia information marks according to the system log, and clicking links corresponding to the multimedia information marks.
It should be understood that the click link corresponding to each advertisement ID may be extracted from the daily advertisement system log, however, since one advertisement may be recommended multiple times to be displayed, in order to achieve a better processing effect, the advertisement IDs may be de-duplicated, so as to ensure that each advertisement ID corresponds to a specific click link, and thus, the determining a plurality of multimedia information labels according to the system log, and identifying the corresponding click link by each multimedia information label includes:
Determining a plurality of multimedia information marks according to the system log; detecting whether a repeated mark exists in the multimedia information mark; and if the repeated marks do not exist in the multimedia information marks, determining clicking links corresponding to the multimedia information marks according to the system log. If the repeated mark exists in the multimedia information mark, performing de-duplication processing on the multimedia information mark to obtain a de-duplicated multimedia information mark; and determining clicking links corresponding to the multimedia information identifiers after the duplication removal according to the system log.
It can be understood that, firstly, advertisement IDs corresponding to recommended advertisements can be determined according to the system log, then whether repeated IDs exist in the advertisement IDs is detected, a detection result is obtained, and different control strategies are adopted according to different detection results.
It can be understood that if the detection result is that no duplicate ID exists in the advertisement IDs, the click links corresponding to the advertisement IDs may be directly extracted from the system log. If the detection result is that the repeated ID exists in the advertisement ID, the repeated ID in the advertisement ID is required to be removed, the repeated advertisement ID is obtained, and then click links corresponding to the repeated advertisement ID are extracted from the system log.
Step S202, obtaining a corresponding landing page picture according to the click link.
It can be understood that the advertisement landing page corresponding to the advertisement ID can be accessed by clicking the link, and then the landing page picture corresponding to the advertisement landing page can be obtained by means of screen capturing. However, since the number of click links obtained from the advertisement system log is large, it is necessary to obtain click links corresponding to newly appeared advertisement IDs in increments every day, and since the amount of data is large, if the floor page picture is obtained manually, it is necessary to consume a lot of labor cost, so that in order to save the labor cost, the obtaining efficiency of the floor page picture is improved, and the floor page picture corresponding to the advertisement floor page can be obtained by programmed automatic processing.
Further, in order to automatically acquire the landing page picture, a preset browser can be called to access a multimedia information page corresponding to the click link, a preset interface of the preset browser is called to perform screen capturing processing based on the multimedia information page, screen capturing information corresponding to the multimedia information page is obtained, and the corresponding landing page picture is obtained according to the screen capturing information.
It should be noted that, the preset browser in this embodiment may include, but is not limited to, a google browser, or may be other browsers, which is not limited in this embodiment, and in this embodiment, the preset browser is described as a google browser. Accordingly, the preset interface in this embodiment may be an API interface with a screen capturing function in the google browser.
It can be understood that the chromadriver is a driver of the google browser, and the google browser can be called in a program by installing a selenium frame, so that clicking links of advertisements can be automatically accessed, and the advertisement landing page corresponding to the advertisements can be entered, and meanwhile, an API interface with a screen capturing function of the google browser is called to perform screen capturing processing on the advertisement landing page, so that screen capturing information corresponding to the advertisement landing page is obtained, and a landing page picture corresponding to the advertisements is obtained according to the screen capturing information.
In this embodiment, a plurality of multimedia information marks and click links corresponding to the multimedia information marks are determined according to the system log, and corresponding landing page pictures are obtained according to the click links, so that landing page pictures of the multimedia information landing pages corresponding to the multimedia information marks can be obtained in an automatic mode, the efficiency of data processing is improved, and the efficiency of extracting the landing page features is further improved.
In an embodiment, as shown in fig. 4, a third embodiment of the landing page feature extraction method according to the present invention is provided based on the first embodiment or the second embodiment, and in this embodiment, the step S50 further includes, after the step S50:
And step S60, generating a training sample according to the landing page characteristics.
It should be understood that after the landing page features corresponding to the landing page of the multimedia information are obtained, a training sample may be generated according to the landing page features, and then a preset click through rate model may be trained according to the training sample, so as to obtain a target click through rate model that may be used for estimating the click rate of the advertisement.
Further, in order to achieve a better model training effect, when generating the training sample, besides the landing page features, the landing page features may be combined with other features in the system log to generate the training sample, where the landing page features generate the training sample, and the method includes:
extracting multimedia information side characteristics and user side characteristics from the system log; generating an initial sample according to the multimedia information side characteristics and the user side characteristics; and generating a training sample according to the landing page characteristics and the initial sample.
It can be understood that the multimedia information side features in this embodiment may be advertisement side features, advertisement side features and user side features may be extracted from the advertisement system log, an initial sample is generated according to the advertisement side features and the user side features, and then the landing page features are added to the initial sample on the basis of the initial sample, so as to obtain a training sample.
It can be appreciated that after the landing page subject word of the advertisement landing page is obtained, the landing page subject word can be spliced into the initial sample by using the advertisement ID as a key, so as to obtain a training sample for model training.
And step S70, training a preset click through rate model according to the training sample to obtain a target click through rate model.
It should be noted that, the preset click through rate model in this embodiment may be a preset CTR model, and after a training sample is obtained, the CTR model may be trained according to the training sample to obtain a target CTR model capable of accurately predicting the click through rate.
It should be appreciated that, because the page-down subject term may be a plurality of phrases, it is a multi-valued feature that can be aggregated in the dimension of building in the model training, and ultimately optimize the predictive effect of the model. Therefore, training samples can be aggregated in the embedding dimension to obtain aggregated training samples, and then the CTR model is trained according to the aggregated training samples to obtain the target CTR model.
And S80, predicting the probability of clicking the multimedia information by a user through the target click through rate model.
It can be understood that after the target CTR model is obtained, the advertisement click rate can be predicted by presetting the CTR model, so as to obtain a prediction result, and the probability of clicking the advertisement by the user is determined according to the prediction result. According to the scheme, after the landing page features are added in the model training process, the accuracy of the prediction result of the model can be improved, so that more targeted advertisement recommendation can be performed according to the prediction result, and the advertisement recommendation effect of the advertisement system is improved.
In this embodiment, a training sample is generated according to the landing page feature; training a preset click through rate model according to the training sample to obtain a target click through rate model; and predicting the probability of clicking the multimedia information by the user through the target click passing rate model. Therefore, the accuracy of the model can be improved by introducing the landing page features in the model training process, and the accuracy of the model prediction result is improved, so that a better multimedia information recommendation effect is achieved.
In addition, the embodiment of the invention also provides a storage medium, wherein the storage medium is stored with a landing page feature extraction program, and the landing page feature extraction program realizes the steps of the landing page feature extraction method when being executed by a processor.
Because the storage medium adopts all the technical schemes of all the embodiments, the storage medium has at least all the beneficial effects brought by the technical schemes of the embodiments, and the description is omitted here.
In addition, referring to fig. 5, an embodiment of the present invention further provides a landing page feature extraction device, where the landing page feature extraction device includes:
the system log module 10 is configured to obtain a system log corresponding to the multimedia information.
It should be noted that, the multimedia information in this embodiment may include, but is not limited to, advertisement, video, picture, audio, and other multimedia information, and this embodiment is not limited thereto, and in this embodiment, the multimedia information is taken as an advertisement for illustration. Accordingly, the system log corresponding to the multimedia information in the embodiment may be an advertisement system log corresponding to the advertisement. The advertisement may include, but is not limited to, various types of advertisements such as a picture advertisement, a text advertisement, a keyword advertisement, a ranking advertisement, and a video advertisement.
It should be appreciated that with the increasing growth of the internet and the continued development of electronic commerce, commercial traffic has become an important business for many internet companies. Among these, computing advertising is receiving widespread attention as an emerging interdisciplinary. Computing advertising encompasses a wide range of theories and techniques including advertising, information retrieval, text analysis, statistical models, machine learning, micro-economics, and the like. Computing advertisements refers to the use of big data technology to provide specific advertisements to characteristic internet audiences and has long been an important means of commercial rendering for internet companies. CTR is a main research point in the field of computing advertisements, and click rate estimation of online advertisements can be performed through a CTR model, so that the CTR model plays a vital role in internet advertisement service.
It should be appreciated that click rate estimates are typically used to determine the probability that a user clicks on a recommended advertisement. The advertisement system typically recommends advertisements to the user based on the results of the predictions of the CTR model, which also includes factors for advertiser bidding. The good CTR model can create maximum benefits for the advertising system, and can bring good experience to users. Modern advertising systems are largely divided into search advertisements and show advertisements, where search advertisements are the most modular and fastest growing forms of advertisements, but in both forms of advertisements, the CTR model is one of the most critical technologies.
It should be appreciated that the development of the CTR model goes from a linear model to a deep model, requiring pre-extraction of model features for both stages of development. The existing model feature engineering is the extraction of basic attributes of users and advertisements. The depth model has more excellent feature crossing capability than the linear model. But the existing model feature engineering lacks specific description features for the advertisement landing page. Therefore, the method mainly solves the problem of how to extract the characteristics of the advertisement landing page, and obtains more accurate information to describe the advertisement landing page, so that the conversion rate of the advertisement is improved.
It should be noted that, the scheme can be mainly divided into five parts, namely, an advertisement system log, an advertisement landing page picture interception, a text recognition, a theme extraction and a construction feature generation sample. The advertisement system log is a log obtained by recording advertisement data recommended by a daily system, and comprises information of multiple dimensions such as display, clicking, consumption and the like. The initial sample of the CTR model is to extract the corresponding advertisement side features and user side features according to the system logs, so as to perform model training.
Therefore, the system log corresponding to the multimedia information can be obtained and stored in the database by recording the multimedia information recommended by the system every day in advance, and the corresponding system log is obtained from the database when the landing page feature of the multimedia information is carried out.
And the landing page picture module 20 is configured to obtain a corresponding landing page picture according to the system log.
It should be noted that, the landing page is also called a landing page, and refers to a first page that a user enters by clicking on an advertisement material or a link. The vast majority of landing pages are used for marketing or advertising campaigns, where landing pages are pages that are displayed to potential users after clicking on advertisements using a search engine.
It should be understood that, according to the system log, the advertisement landing page picture corresponding to each advertisement contained in the system log can be obtained, wherein, in order to facilitate distinguishing each advertisement, corresponding identifiers, i.e. advertisement marks, can be set for different advertisements. The label in this embodiment may be an ID, or may be another label that can achieve the same or similar functions, which is not limited in this embodiment, and in this embodiment, an advertisement label is described as an advertisement ID.
It can be understood that the advertisements recommended by the system can be determined according to the system log, the advertisement IDs corresponding to the advertisements can be obtained, meanwhile, the clicking links corresponding to the advertisement IDs can be determined according to the system log, the advertisement landing page corresponding to the advertisements can be accessed by accessing the clicking links, and then the landing page pictures corresponding to the advertisement landing page can be intercepted.
And the character recognition module 30 is used for carrying out character recognition on the landing page picture to obtain character information.
It should be appreciated that after the landing page picture corresponding to the advertisement landing page of each advertisement is obtained, text recognition may be performed on the landing page picture, so that text information contained in the landing page picture is recognized.
It can be understood that the text recognition can be performed on the landing page picture through a preset text recognition tool, so that text information contained in the landing page picture is obtained. The preset character recognition tool may include, but is not limited to, optical character recognition (Optical Character Recognition, OCR) character recognition software, which is not limited in this embodiment.
It will be appreciated that after the landing page picture is obtained, the landing page picture may be passed into an OCR service, through which text information contained in the landing page picture is identified.
It should be understood that although the OCR service may accept various formats of pictures, because of different efficiency in performing text recognition on pictures of different formats by the OCR service, in order to achieve better text recognition efficiency, the landing page picture may be converted into a to-be-processed landing page picture of a preset format, and then text recognition is performed on the to-be-processed landing page picture by a preset text recognition tool, so as to obtain text information included in the to-be-processed landing page picture.
It should be noted that, the preset format in the present embodiment may include, but is not limited to, a Base64 coding format, which is not limited to this embodiment. The floor page picture can be converted into the floor page picture to be processed in the Base64 coding format, and then the floor page picture to be processed in the Base64 coding format is transmitted to the OCR service, so that text content information contained in the floor page picture to be processed is obtained and stored in a local disk, and the efficiency of the whole flow is optimized.
And the description information module 40 is configured to determine landing page description information corresponding to the landing page picture according to the text information.
It should be appreciated that after determining the text information contained in the landing page picture, landing page description information corresponding to the landing page picture may be determined according to the text information. The landing page description information may include, but is not limited to, a landing page subject word, which is described as an example in the present embodiment.
Further, in order to more accurately obtain the description information of the landing page corresponding to the landing page picture, the description information module 40 is further configured to extract candidate subject words from the text information through a preset natural language processing tool; determining a subject word of a landing page according to the candidate subject word; and taking the landing page subject words as landing page description information for describing the landing page pictures.
It should be noted that, the preset natural language processing tool in this embodiment may be a natural language processing (Natural Language Processing, NLP) model, where the NLP model may include, but is not limited to, TF-IDF algorithm, word2vec algorithm, LDA topic model, etc., and these models may all implement topic Word extraction, and in addition, may include other models that may implement the same or similar functions, which is not limited in this embodiment. The TF-IDF algorithm is preferred to extract the subject terms in this embodiment because of its simplicity compared to other models and algorithms.
In a specific implementation, after the text information of the landing page picture corresponding to all the advertisement IDs is extracted, candidate subject terms may be extracted from the text information by TF-IDF algorithm, and then the landing page subject terms corresponding to the landing page picture may be determined according to the corresponding relationship between the text information and the landing page picture and the candidate subject terms, where the number of subject terms in this embodiment may be one or more, which is not limited in this embodiment.
It is understood that after determining the landing page subject word corresponding to the landing page picture, the landing page subject word may be used as landing page description information for describing the landing page picture, so that a final landing page subject word may be generated for each advertisement ID to describe the content of the landing page thereof.
And the feature construction module 50 is used for generating landing page features according to the landing page description information.
It should be understood that after determining the landing page description information of the landing page corresponding to each advertisement ID, the advertisement ID may be traversed, and the traversed advertisement ID is used as the current identifier, then all the landing page description information corresponding to the current identifier is acquired, and then the landing page description information corresponding to the current identifier is spliced to obtain the landing page feature corresponding to the current identifier. And then taking the traversed next advertisement ID as the current identifier, and repeating the steps to obtain the landing page characteristics corresponding to the current identifier until all the advertisement IDs are traversed, so that each advertisement ID has the corresponding landing page characteristics.
It is understood that, since the landing page description information in the present embodiment may be a landing page subject word, the stitching of the landing page description information may be substantially stitching the landing page subject word.
In a specific implementation, it is assumed that three advertisement IDs, namely advertisement 1, advertisement 2 and advertisement 3, are respectively used, and when traversing, advertisement 1 is first used as a current identifier, all landing page description information corresponding to advertisement 1 can be spliced, so as to obtain the landing page characteristics corresponding to advertisement 1. Then, performing next traversal, taking the advertisement 2 as a current identifier, and splicing all landing page description information corresponding to the advertisement 2 to obtain the landing page characteristics corresponding to the advertisement 2. And then continuing to traverse, taking the advertisement 3 as a current identifier, and splicing all landing page description information corresponding to the advertisement 3 to obtain landing page characteristics corresponding to the advertisement 3.
It should be noted that the present solution belongs to a combination innovative form, and the beneficial effects are achieved by combining multiple technologies. The processing of the click log, automatic webpage screenshot, real-time OCR recognition, extraction of text subject words and the like are all cores of the scheme. In scheme details, the efficiency and the accuracy of automatic webpage screenshot and OCR recognition are optimized, and the accurate extraction and the depiction of the characteristics of the advertisement landing page are realized by combining a plurality of technologies. According to the method and the system, the floor page characteristics of the advertisements can be described more accurately, the probability of whether effective conversion is formed after a user clicks a certain advertisement is described through the characteristics, and after the characteristics are added into the model, the estimated index of the model is improved to a certain extent, so that the recommendation accuracy and income of the whole advertisement system are improved.
In this embodiment, a system log corresponding to the multimedia information is obtained; obtaining a corresponding landing page picture according to the system log; performing character recognition on the landing page picture to obtain character information; determining landing page description information corresponding to the landing page picture according to the text information; and generating landing page characteristics according to the landing page description information. Therefore, the technology can be combined to accurately extract and describe the landing page characteristics of the multimedia information, the landing page characteristics of the multimedia information can be described more accurately, and the conversion rate of the multimedia information is improved.
In an embodiment, the landing page picture module 20 is further configured to determine a plurality of multimedia information marks according to the system log, and click links corresponding to the multimedia information marks; and obtaining a corresponding landing page picture according to the click link.
In an embodiment, the landing page picture module 20 is further configured to access a corresponding multimedia information page according to the click link; and performing screen capturing processing based on the multimedia information page to obtain a corresponding landing page picture.
In an embodiment, the landing page picture module 20 is further configured to invoke a preset browser to access a multimedia information page corresponding to the click link; calling a preset interface of the preset browser to perform screen capturing processing based on a multimedia information page, and obtaining screen capturing information corresponding to the multimedia information page; and obtaining a corresponding landing page picture according to the screen capturing information.
In an embodiment, the landing page picture module 20 is further configured to determine a plurality of multimedia information marks according to the system log; detecting whether a repeated mark exists in the multimedia information mark; and if the repeated marks do not exist in the multimedia information marks, determining clicking links corresponding to the multimedia information marks according to the system log.
In an embodiment, the landing page picture module 20 is further configured to, if there is a duplicate tag in the multimedia information tag, perform a deduplication process on the multimedia information tag to obtain a deduplicated multimedia information tag; and determining clicking links corresponding to the multimedia information identifiers after the duplication removal according to the system log.
In an embodiment, the text recognition module 30 is further configured to convert the landing page picture into a to-be-processed landing page picture in a preset format; and carrying out character recognition on the to-be-processed landing page picture through a preset character recognition tool to obtain character information contained in the to-be-processed landing page picture.
In one embodiment, the description information module 40 is further configured to extract candidate subject terms from the text information through a preset natural language processing tool; determining a subject word of a landing page according to the candidate subject word; and taking the landing page subject words as landing page description information for describing the landing page pictures.
In an embodiment, the feature construction module 50 is further configured to traverse the multimedia information identifier, and take the traversed multimedia information identifier as the current identifier; and splicing the landing page description information corresponding to the current identifier to obtain the landing page characteristics.
In an embodiment, the landing page feature extraction device further includes a click rate prediction module, configured to generate a training sample according to the landing page feature; training a preset click through rate model according to the training sample to obtain a target click through rate model; and predicting the probability of clicking the multimedia information by the user through the target click passing rate model.
In an embodiment, the click rate prediction module is further configured to aggregate the training samples in an embedding dimension to obtain aggregated training samples; training the preset click through rate model according to the aggregated training samples to obtain a target click through rate model.
In an embodiment, the click rate prediction module is further configured to extract a multimedia information side feature and a user side feature from the system log; generating an initial sample according to the multimedia information side characteristics and the user side characteristics; and generating a training sample according to the landing page characteristics and the initial sample.
Other embodiments or specific implementation methods of the landing page feature extraction device of the present invention may refer to the above method embodiments, and are not described herein.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in an estimator readable storage medium (e.g. ROM/RAM, magnetic disk, optical disk) as described above, comprising instructions for causing a smart device (which may be a mobile phone, estimator, landing page feature extraction device, or network landing page feature extraction device, etc.) to perform the method according to the embodiments of the present invention.
The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.
The invention discloses an A1 floor page feature extraction method, which comprises the following steps:
acquiring a system log corresponding to the multimedia information;
obtaining a corresponding landing page picture according to the system log;
performing character recognition on the landing page picture to obtain character information;
determining landing page description information corresponding to the landing page picture according to the text information;
and generating landing page characteristics according to the landing page description information.
A2, the method for extracting the landing page features according to A1, wherein the step of obtaining the corresponding landing page picture according to the system log comprises the following steps:
determining a plurality of multimedia information marks according to the system log, and clicking links corresponding to the multimedia information marks;
and obtaining a corresponding landing page picture according to the click link.
A3, the method for extracting the landing page features according to A2, wherein the step of obtaining the corresponding landing page picture according to the click link comprises the following steps:
Accessing a corresponding multimedia information page according to the click link;
and performing screen capturing processing based on the multimedia information page to obtain a corresponding landing page picture.
A4, the method for extracting the floor page features according to A3, wherein the accessing the corresponding multimedia information page according to the click link comprises the following steps:
calling a preset browser to access a multimedia information page corresponding to the click link;
correspondingly, the screen capturing process is performed based on the multimedia information page to obtain a corresponding landing page picture, which comprises the following steps:
calling a preset interface of the preset browser to perform screen capturing processing based on a multimedia information page, and obtaining screen capturing information corresponding to the multimedia information page;
and obtaining a corresponding landing page picture according to the screen capturing information.
A5, determining a plurality of multimedia information marks according to the system log and clicking links corresponding to the multimedia information marks according to the floor page feature extraction method of A2, wherein the method comprises the following steps:
determining a plurality of multimedia information marks according to the system log;
detecting whether a repeated mark exists in the multimedia information mark;
and if the repeated marks do not exist in the multimedia information marks, determining clicking links corresponding to the multimedia information marks according to the system log.
A6, the method for extracting the landing page features according to A5, wherein after detecting whether the repeated marks exist in the multimedia information marks, further comprises:
if the repeated mark exists in the multimedia information mark, performing de-duplication processing on the multimedia information mark to obtain a de-duplicated multimedia information mark;
and determining clicking links corresponding to the multimedia information identifiers after the duplication removal according to the system log.
A7, carrying out character recognition on the landing page picture to obtain character information according to the landing page feature extraction method of any one of A1 to A6, wherein the character information comprises the following steps:
converting the landing page picture into a to-be-processed landing page picture in a preset format;
and carrying out character recognition on the to-be-processed landing page picture through a preset character recognition tool to obtain character information contained in the to-be-processed landing page picture.
A8, determining landing page description information corresponding to the landing page picture according to the text information according to the landing page feature extraction method of any one of A1 to A6, wherein the landing page description information comprises the following steps:
extracting candidate subject words from the text information through a preset natural language processing tool;
determining a subject word of a landing page according to the candidate subject word;
And taking the landing page subject words as landing page description information for describing the landing page pictures.
A9, the landing page feature extraction method of any one of A1 to A6, the generating a landing page feature according to the landing page description information, includes:
traversing the multimedia information mark, and taking the traversed multimedia information mark as a current mark;
and splicing the landing page description information corresponding to the current identifier to obtain the landing page characteristics.
A10, the landing page feature extraction method according to any one of A1 to A6, wherein after generating the landing page feature according to the landing page description information, the method further comprises:
generating a training sample according to the landing page characteristics;
training a preset click through rate model according to the training sample to obtain a target click through rate model;
and predicting the probability of clicking the multimedia information by the user through the target click passing rate model.
A11, the floor page feature extraction method as described in A10, wherein training the preset click through rate model according to the training sample to obtain a target click through rate model comprises:
aggregating the training samples in the embedding dimension to obtain aggregated training samples;
Training the preset click through rate model according to the aggregated training samples to obtain a target click through rate model.
A12, the method for extracting the landing page features according to A10, wherein the generating training samples according to the landing page features comprises:
extracting multimedia information side characteristics and user side characteristics from the system log;
generating an initial sample according to the multimedia information side characteristics and the user side characteristics;
and generating a training sample according to the landing page characteristics and the initial sample.
The invention also discloses a B13, a landing page feature extraction device, which comprises:
the system log module is used for acquiring a system log corresponding to the multimedia information;
the landing page picture module is used for obtaining a corresponding landing page picture according to the system log;
the character recognition module is used for carrying out character recognition on the landing page picture to obtain character information;
the description information module is used for determining landing page description information corresponding to the landing page picture according to the text information;
and the characteristic construction module is used for generating landing page characteristics according to the landing page description information.
The landing page feature extraction device as described in B13, wherein the landing page picture module is further configured to determine a plurality of multimedia information marks according to the system log, and click links corresponding to the multimedia information marks; and obtaining a corresponding landing page picture according to the click link.
The floor page feature extraction device as described in B15, wherein the floor page picture module is further configured to access a corresponding multimedia information page according to the click link; and performing screen capturing processing based on the multimedia information page to obtain a corresponding landing page picture.
The floor page feature extraction device as described in the B15, wherein the floor page picture module is further configured to invoke a preset browser to access a multimedia information page corresponding to the click link; calling a preset interface of the preset browser to perform screen capturing processing based on a multimedia information page, and obtaining screen capturing information corresponding to the multimedia information page; and obtaining a corresponding landing page picture according to the screen capturing information.
B17, the landing page feature extraction device of B14, the said landing page picture module, is used for confirming a plurality of multimedia information marks according to the said system log; detecting whether a repeated mark exists in the multimedia information mark; and if the repeated marks do not exist in the multimedia information marks, determining clicking links corresponding to the multimedia information marks according to the system log.
B18, the landing page feature extraction device as described in B17, wherein the landing page picture module is further configured to perform de-duplication processing on the multimedia information tag if there is a duplicate tag in the multimedia information tag, so as to obtain a de-duplicated multimedia information tag; and determining clicking links corresponding to the multimedia information identifiers after the duplication removal according to the system log.
The invention also discloses C19, a landing page feature extraction device, which comprises: the system comprises a memory, a processor and a landing page feature extraction program which is stored in the memory and can run on the processor, wherein the landing page feature extraction program realizes the landing page feature extraction method when being executed by the processor.
The invention also discloses D20 and a storage medium, wherein the storage medium is stored with a landing page feature extraction program, and the landing page feature extraction program realizes the landing page feature extraction method when being executed by a processor.

Claims (10)

1. The floor page feature extraction method is characterized by comprising the following steps of:
acquiring a system log corresponding to the multimedia information;
obtaining a corresponding landing page picture according to the system log;
performing character recognition on the landing page picture to obtain character information;
determining landing page description information corresponding to the landing page picture according to the text information;
and generating landing page characteristics according to the landing page description information.
2. The method for extracting features of a landing page according to claim 1, wherein said obtaining a corresponding landing page picture from the system log includes:
Determining a plurality of multimedia information marks according to the system log, and clicking links corresponding to the multimedia information marks;
and obtaining a corresponding landing page picture according to the click link.
3. The method for extracting the landing page features of claim 2, wherein the obtaining the corresponding landing page picture according to the click link includes:
accessing a corresponding multimedia information page according to the click link;
and performing screen capturing processing based on the multimedia information page to obtain a corresponding landing page picture.
4. The method for extracting features of a landing page according to claim 3, wherein said accessing a corresponding multimedia information page according to said click link comprises:
calling a preset browser to access a multimedia information page corresponding to the click link;
correspondingly, the screen capturing process is performed based on the multimedia information page to obtain a corresponding landing page picture, which comprises the following steps:
calling a preset interface of the preset browser to perform screen capturing processing based on a multimedia information page, and obtaining screen capturing information corresponding to the multimedia information page;
and obtaining a corresponding landing page picture according to the screen capturing information.
5. The method for extracting the landing page features of claim 2, wherein determining a plurality of multimedia information marks according to the system log, and clicking links corresponding to the multimedia information marks, comprises:
determining a plurality of multimedia information marks according to the system log;
detecting whether a repeated mark exists in the multimedia information mark;
and if the repeated marks do not exist in the multimedia information marks, determining clicking links corresponding to the multimedia information marks according to the system log.
6. The method of claim 5, wherein after detecting whether there is a duplicate tag in the multimedia information tag, further comprising:
if the repeated mark exists in the multimedia information mark, performing de-duplication processing on the multimedia information mark to obtain a de-duplicated multimedia information mark;
and determining clicking links corresponding to the multimedia information identifiers after the duplication removal according to the system log.
7. The method for extracting features of a landing page according to any one of claims 1 to 6, wherein the performing text recognition on the landing page picture to obtain text information includes:
Converting the landing page picture into a to-be-processed landing page picture in a preset format;
and carrying out character recognition on the to-be-processed landing page picture through a preset character recognition tool to obtain character information contained in the to-be-processed landing page picture.
8. A landing page feature extraction device, characterized in that the landing page feature extraction device comprises:
the system log module is used for acquiring a system log corresponding to the multimedia information;
the landing page picture module is used for obtaining a corresponding landing page picture according to the system log;
the character recognition module is used for carrying out character recognition on the landing page picture to obtain character information;
the description information module is used for determining landing page description information corresponding to the landing page picture according to the text information;
and the characteristic construction module is used for generating landing page characteristics according to the landing page description information.
9. A landing page feature extraction apparatus, characterized by comprising: a memory, a processor, and a landing page feature extraction program stored on the memory and executable on the processor, which when executed by the processor, implements the landing page feature extraction method according to any one of claims 1 to 7.
10. A storage medium having stored thereon a landing page feature extraction program which, when executed by a processor, implements the landing page feature extraction method according to any one of claims 1 to 7.
CN202111519476.2A 2021-12-13 2021-12-13 Floor page feature extraction method, device, equipment and storage medium Pending CN116263990A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111519476.2A CN116263990A (en) 2021-12-13 2021-12-13 Floor page feature extraction method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111519476.2A CN116263990A (en) 2021-12-13 2021-12-13 Floor page feature extraction method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116263990A true CN116263990A (en) 2023-06-16

Family

ID=86723199

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111519476.2A Pending CN116263990A (en) 2021-12-13 2021-12-13 Floor page feature extraction method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116263990A (en)

Similar Documents

Publication Publication Date Title
US11620321B2 (en) Artificial intelligence based method and apparatus for processing information
CN107679211B (en) Method and device for pushing information
CN107609152B (en) Method and apparatus for expanding query expressions
WO2016161976A1 (en) Method and device for selecting data content to be pushed to terminals
US9785888B2 (en) Information processing apparatus, information processing method, and program for prediction model generated based on evaluation information
CN110019616B (en) POI (Point of interest) situation acquisition method and equipment, storage medium and server thereof
US20090319449A1 (en) Providing context for web articles
US9483462B2 (en) Generating training data for disambiguation
CN104850546B (en) Display method and system of mobile media information
US11055373B2 (en) Method and apparatus for generating information
CN102207936B (en) Method and system for indicating content change of electronic document
CN108334489B (en) Text core word recognition method and device
CN103310003A (en) Method and system for predicting click rate of new advertisement based on click log
CN105095444A (en) Information acquisition method and device
CN110765973B (en) Account type identification method and device
CN107977678B (en) Method and apparatus for outputting information
CN111400586A (en) Group display method, terminal, server, system and storage medium
CN114238573A (en) Information pushing method and device based on text countermeasure sample
CN110245357B (en) Main entity identification method and device
US10963690B2 (en) Method for identifying main picture in web page
CN114880498B (en) Event information display method and device, equipment and medium
CN108595453B (en) URL (Uniform resource locator) identifier mapping obtaining method and device
CN116263990A (en) Floor page feature extraction method, device, equipment and storage medium
CN115238078A (en) Webpage information extraction method, device, equipment and storage medium
JP2023554210A (en) Sort model training method and apparatus for intelligent recommendation, intelligent recommendation method and apparatus, electronic equipment, storage medium, and computer program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination