CN116070052A - Interface data transmission method, device, terminal and storage medium - Google Patents

Interface data transmission method, device, terminal and storage medium Download PDF

Info

Publication number
CN116070052A
CN116070052A CN202310042516.1A CN202310042516A CN116070052A CN 116070052 A CN116070052 A CN 116070052A CN 202310042516 A CN202310042516 A CN 202310042516A CN 116070052 A CN116070052 A CN 116070052A
Authority
CN
China
Prior art keywords
acquisition mode
target webpage
link
target
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310042516.1A
Other languages
Chinese (zh)
Inventor
王峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Aijiwei Consulting Xiamen Co ltd
Original Assignee
Aijiwei Consulting Xiamen Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Aijiwei Consulting Xiamen Co ltd filed Critical Aijiwei Consulting Xiamen Co ltd
Priority to CN202310042516.1A priority Critical patent/CN116070052A/en
Publication of CN116070052A publication Critical patent/CN116070052A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation

Abstract

The embodiment of the invention discloses an interface data transmission method, an interface data transmission device, a terminal and a storage medium. According to the scheme, the target webpage can be determined, element information in the target webpage is acquired, the acquisition mode is set according to the type of the target webpage and the element information, the acquisition mode comprises an html acquisition mode and a simulator acquisition mode, configuration parameters are set according to the target webpage, link data in the target webpage are acquired based on the acquisition mode and the configuration parameters, and the link data, title information and detail page information corresponding to the link data are transmitted to a database and stored. The scheme provided by the embodiment of the application can select the corresponding acquisition mode and the configuration parameters according to the target webpage, so that the webpage links are extracted rapidly, and the acquisition efficiency is improved effectively.

Description

Interface data transmission method, device, terminal and storage medium
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to an interface data transmission method, an apparatus, a terminal, and a storage medium.
Background
Most web pages exist with hyperlinks on the internet. These hyperlinks link the individual web pages together to form a vast network, i.e., a hyperlinked network. The data acquisition system starts from some webpages as a network program, stores the content of the webpages, searches hyperlinks in the webpages, accesses the hyperlinks, and repeats the process, and the process can be continuously performed. In the face of the advent of the large data age, the importance of data acquisition systems has grown more and more if quantitative analysis of data is desired.
In the actual use process, the applicant finds that most of the data acquisition systems on the market at present are first generation data acquisition systems, and the first generation data acquisition uses the advantage of repetitive work of a computer to acquire and process templates manufactured by a data analyst in batches. Therefore, if the original website is modified, the configured template is invalid, a data analyst is required to make the module again, the problem of repeatability of the website articles is not well avoided, a large amount of intervention of the data analyst is required, a large amount of time and energy are spent, and the collection efficiency is not high.
Disclosure of Invention
The embodiment of the invention provides an interface data transmission method, an interface data transmission device, a terminal and a storage medium, which can select a corresponding acquisition mode and configuration parameters according to a target webpage, so that webpage links can be rapidly extracted, and the acquisition efficiency is effectively improved.
The embodiment of the invention provides an interface data transmission method, which comprises the following steps:
determining a target webpage and acquiring element information in the target webpage;
setting an acquisition mode according to the type of the target webpage and the element information, wherein the acquisition mode comprises an html acquisition mode and a simulator acquisition mode;
Setting configuration parameters according to the target webpage, and acquiring link data in the target webpage based on the acquisition mode and the configuration parameters;
and transmitting the link data, and title information and detail page information corresponding to the link data to a database and storing the same.
The embodiment of the invention also provides an interface data transmission device, which comprises:
the determining unit is used for determining a target webpage and acquiring element information in the target webpage;
the setting unit is used for setting a collection mode according to the type of the target webpage and the element information, wherein the collection mode comprises an html collection mode and a simulator collection mode;
the acquisition unit is used for setting configuration parameters according to the target webpage and acquiring link data in the target webpage based on the acquisition mode and the configuration parameters;
and the storage unit is used for transmitting the link data, and title information and detail page information corresponding to the link data to a database and storing the same.
The embodiment of the invention also provides a terminal, which comprises: the interface data transmission method comprises the steps of the interface data transmission method provided by any one of the embodiments of the invention when the application program processing program is executed by the processor.
The embodiment of the invention also provides a storage medium which stores a plurality of instructions, wherein the instructions are suitable for being loaded by a processor so as to execute any interface data transmission method provided by the embodiment of the invention.
According to the interface data transmission method provided by the embodiment of the invention, the target webpage can be determined, the element information in the target webpage can be obtained, the acquisition mode is set according to the type of the target webpage and the element information, the acquisition mode comprises an html acquisition mode and a simulator acquisition mode, the configuration parameters are set according to the target webpage, the link data in the target webpage are acquired based on the acquisition mode and the configuration parameters, and the link data, the title information and the detail page information corresponding to the link data are transmitted to the database and stored. The scheme provided by the embodiment of the application can select the corresponding acquisition mode and the configuration parameters according to the target webpage, so that the webpage links are extracted rapidly, and the acquisition efficiency is improved effectively.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic flow chart of a first method for transmitting interface data according to an embodiment of the present invention;
fig. 2 is a second flowchart of an interface data transmission method according to an embodiment of the present invention;
fig. 3 is a schematic diagram of a first configuration of an interface data transmission device according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a second structure of an interface data transmission device according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a terminal according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, the element defined by the phrase "comprising one … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element, and furthermore, elements having the same name in different embodiments of the present application may have the same meaning or may have different meanings, a particular meaning of which is to be determined by its interpretation in this particular embodiment or by further combining the context of this particular embodiment.
It should be understood that, although the steps in the flowcharts in the embodiments of the present application are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the figures may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily occurring in sequence, but may be performed alternately or alternately with other steps or at least a portion of the other steps or stages.
It should be noted that, in this document, step numbers such as 101 and 102 are used for the purpose of describing the corresponding content more clearly and briefly, and not to constitute a substantial limitation on the sequence, and those skilled in the art may execute 102 first and then execute 101 when they are implemented, which is within the scope of protection of the present application.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the invention. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.
The embodiment of the invention provides an interface data transmission method, and an execution main body of the interface data transmission method can be the interface data transmission device provided by the embodiment of the invention or an intelligent terminal and a server integrated with the interface data transmission device, wherein the interface data transmission device can be realized in a hardware or software mode.
Before describing the technical scheme of the invention, related technical terms are briefly explained:
url: uniformresource locator (uniform resource locator system), a representation method for specifying information locations on a web service program of the internet.
href: the value of the href attribute may be the relative or absolute URL of any valid document, including a fragment identifier and JavaScript code fragment.
head: the header, which is the string sent by the server before the server transmits the HTML data to the browser in the HTTP protocol, is separated from the HTML file by a line.
API: application Programming Interface (application programming interface), which is a number of predefined functions, is designed to provide the application and developer the ability to access a set of routines based on certain software or hardware without having to access source code or understand the details of the internal operating mechanisms.
As shown in fig. 1, fig. 1 is a schematic flow chart of a first process of an interface data transmission method according to an embodiment of the present invention, where a specific flow of the interface data transmission method may be as follows:
101. and determining the target webpage and acquiring element information in the target webpage.
In an embodiment, the element information is a constituent element of a web page, and the element information of the target web page may include a website name, a website address, a root domain address, a region to which the website belongs, an html text of the web page, and the like of the web page. Further, the method can also comprise text, images, animation, audio, video, hyperlinks, navigation classes, tables and other elements in the webpage.
In an html webpage or an xml webpage, the webpage element may include a plurality of child nodes, where each child node includes different information, so that the webpage element becomes a node with complete information. When the user clicks on a web page element, the web page element will be retrieved. And correspondingly displaying different function option groups according to the acquired webpage elements so that a user can select the function options in the function option groups. The result of the user selection corresponds to the specific data that the user needs to collect, such as text, notes, attribute values, etc. of the web page elements.
102. And setting acquisition modes according to the type of the target webpage and the element information, wherein the acquisition modes comprise an html acquisition mode and a simulator acquisition mode.
In the embodiment of the application, when link data of a webpage are acquired, two acquisition modes can be adopted, when an html acquisition mode is used, the system can directly pull html content of a website to analyze, when a simulator acquisition mode is used, the system can open a simulated browser, and after loading of js, html and the like is completed, the system can acquire the content.
The two acquisition modes are different in that the html acquisition mode is high in speed, and the simulator acquisition mode is richer in acquisition content and higher in compatibility. Therefore, the embodiment of the application can also select which acquisition mode to use according to the type of the target webpage and the element information. For example, the security level of the target webpage or the access frequency of the corresponding domain name can be acquired first, so that the webpage with higher security level or access frequency can be set as the simulator acquisition mode, and the webpage with lower security level or access frequency can be set as the html acquisition mode. For another example, the number of links included in the element information of the target web page may be acquired first, and when the number of links is large, the simulator acquisition mode may be used.
In another embodiment, the security level of the target web page, the access frequency of the domain name corresponding to the target web page, and the number of links included in the element information in the target web page may be obtained respectively, then the three factors are combined to determine whether to use the html acquisition mode or the simulator acquisition mode, for example, weight ratios of the three factors are set respectively, and then calculation is performed according to the weight ratios respectively, that is, the step of setting the acquisition mode according to the type of the target web page and the element information may include: and acquiring the security level of the target webpage and the access frequency of the domain name corresponding to the target webpage, calculating the number of links contained in the element information in the target webpage, and determining a target acquisition mode from the html acquisition mode and the simulator acquisition mode according to the security level, the access frequency, the number of links and the weight corresponding to each link.
103. And setting configuration parameters according to the target webpage, and collecting link data in the target webpage based on the collection mode and the configuration parameters.
In the embodiment of the application, different configuration parameters can be set in advance for different crawling links, the task allocator generates a crawling flow based on the created crawling task, allocates the cooperative range of the crawling task in a plurality of different crawling terminals or user levels, and finally crawls link data in a target webpage.
The configuration parameters may include a data source update frequency of the target web page, a detail page access frequency of the target web page, a link extraction rule and a position of at least one link of the target web page, and login information of the target web page.
104. And transmitting the link data and title information and detail page information corresponding to the link data to a database and storing the same.
After the link data in the target webpage is acquired, information such as a title, a task ID, a detail page address, a website and the like corresponding to the link data can be further acquired, the information is stored as a structural body in a structural body, the structural body is stored in a slice for standby, finally, the contents in the slice are uniformly written into a database, the next task flow is established, the time of an original task is updated, and the acquired link number and the identification processing are completed.
In view of the foregoing, the interface data transmission method provided by the embodiment of the invention can determine the target webpage and acquire the element information in the target webpage, set the acquisition mode according to the type of the target webpage and the element information, set the configuration parameters according to the target webpage, acquire the link data in the target webpage based on the acquisition mode and the configuration parameters, and transmit the link data, the title information and the detail page information corresponding to the link data to the database and store the link data. The scheme provided by the embodiment of the application can select the corresponding acquisition mode and the configuration parameters according to the target webpage, so that the webpage links are extracted rapidly, and the acquisition efficiency is improved effectively.
The method according to the previous embodiments will be described in further detail below.
Referring to fig. 2, fig. 2 is a schematic flow chart of a second method for transmitting interface data according to an embodiment of the invention. The method comprises the following steps:
201. and determining the target webpage and acquiring element information in the target webpage.
In one embodiment, the element information of the target webpage may include a website name, a website address, a root domain address, a region to which the website belongs, html text of the webpage, and the like.
202. And setting acquisition modes according to the type of the target webpage and the element information, wherein the acquisition modes comprise an html acquisition mode and a simulator acquisition mode.
In an embodiment, the acquisition type may also be pre-validated, where the acquisition type may include a list mode or a single page mode. Then, the acquisition mode is further set, specifically, the html acquisition mode or the simulator acquisition mode can be set according to the type of the target webpage and the element information, and the specific setting mode is described with reference to the previous embodiment, which is not further described.
Furthermore, before the link data of the target webpage is collected, the detailed information of the collected webpage can be pre-filled, the interval of the access data source is set so as to control the update frequency, and the interval of the collected detail page is set so as to limit the access frequency. When the paging address is filled, if the acquisition mode is an html acquisition mode, the url of the lower page is filled in during acquisition, and the middle part of the url contains replaceable parameters to confirm the specific address of the next page; if the acquisition mode is the simulator acquisition mode, filling in the elements or blocks of a plurality of elements of the next page, identifying the click position of the next page, and waiting for execution of the simulation browser to continue to acquire tasks.
In an embodiment, link extraction rules, specific element positions of extracted links, and title extraction rules may be further set. The link extraction rules specifically include an extraction rule supporting Xpath, an extraction rule supporting cssseector or js path, and the like. If there are elements that need to be logged in to be obtained, the login name and login key also need to be filled in. Setting parameters of the API and an acquisition mode, for example, the acquisition mode is a get mode or a post mode, and a header of the API is matched with a webpage capturing mode, and if the acquired data of the API mode is acquired, the header parameters can be selected to be filled. If there are other ways of requiring special processing, some specific parameters are selected or filled in as the way of special processing, wherein the options can include: relative path, automatic identification of next page according to paging mode, browser, link block, deep search mode, relative link position, pre-processing, etc.
203. When the acquisition mode is an html acquisition mode, setting a link extraction rule into a uniform data format;
specifically, when the html collection mode is used, the html collection mode can be distributed to a common task processor, a history record in a storage medium is preferentially acquired according to the current configuration id, and the link is tried to be accessed and a grabbing task is started, for example, go glue is started as a basic grabbing tool. Then, whether the setting in the link extraction rule is css selector or js path is judged and processed into a unified data format.
204. And pulling data according to the data format, acquiring the code of the header word string, and modifying the code of the header word string in a preset coding format.
And pulling the data and judging the coding of the header, wherein if the coding is the coding of utf-8, the coding is not processed specially, and if the coding is the coding of GBK or GB2312, the mode of setting the preprocessing header to GBK is set.
205. And acquiring a corresponding html code block according to the link extraction rule of the target webpage, and searching an href attribute value in the html code block to serve as link data.
In an embodiment, after searching the html code block for the href attribute value as the link data, the method may further include: judging whether the address of the link data is empty or not; if not, further judging whether the address of the link data starts with http; if the http is not started, the detail page and the address of the link data are spliced according to the relative path configuration to be used as a new link data address.
Specifically, the retry times of the website can be set first, then the corresponding html code block is pulled according to the configuration corresponding to the link extraction rule, the multi-element in the html code block is fetched to perform the cyclic processing, whether the ending element is an a link or not is judged, if the ending element is the a link, the href attribute in the a link can be directly searched to obtain the corresponding link address, if the ending element is not the a link, the configuration of the relative link position can be checked to determine the grabbed element, and further, if the configuration of the relative link position is not present, the a element is directly searched and the href attribute in the a link is searched to obtain the link address.
Further, whether the obtained link address is empty or not is checked, if not, whether the link address is an http-headed link is judged, corresponding processing is not needed for the http-headed link, and if not, the configuration of the relative path is judged and the relative path is processed. Specifically, if the address is a link address beginning with./ the address of the detail page can be split as a url structure, using the Parse function of url in go, splitting the path portion of the detail page into slices according to/as a separator, searching the number of currently grasped links, determining the hierarchy of the relative path, and splicing the obtained slices into a new link address according to the number of grasped links. If the link address is not at the beginning of the./ it is possible to splice the address of the detail page directly with the retrieved link address as a new link address.
Further checking whether the title configuration is configured, if the title configuration is not configured, directly acquiring a title in the a link or capturing characters in the a tag as a title field, and if the title configuration is configured, capturing characters of corresponding elements in the a link as the title field.
206. And when the acquisition mode is a simulator acquisition mode, loading the target webpage through a simulator.
207. All link data in the unit of element blocks is collected based on the position of at least one link in the target webpage.
Specifically, when using the simulator acquisition mode, the data is distributed to an advanced task processor, and go selenum can be used to process advanced tasks, and page selection acquisition elements are opened by using the simulation mode of a chrome or firefox browser. If the connection is a chrome connection, the corresponding request is processed through the chromeDriver of the server, and if the connection is a firefox link, the corresponding request is processed through the gelkodriver of the server. The system opens the corresponding link by using the simulator, judges whether the preprocessing is needed, analyzes the preprocessing operation and performs the corresponding clicking operation if the preprocessing program is set, and continues the current flow if the preprocessing program is not set. And pulling the corresponding old data, and setting retry times, comparison repetition rate and cycle stop identification. And simulating to open a corresponding browser as a carrier to access a preconfigured detail page, starting circulation, waiting for the completion of page loading to acquire elements on the page, determining a link pulling mode according to preconfigured selectable items, and acquiring corresponding links and titles. And after the link is taken, comparing the old data with the old data, if the link is not taken, trying to retry, and if the number of retries exceeds the number, exiting the identification task to fail. When the comparison is carried out, whether the data is the same as the data of the previous page or not is specifically judged, if the data is the same, the retry times are increased, the data is tried to be loaded again, if the repetition rate is greater than or equal to a repetition rate threshold value, the circulation is exited, the rest data is stored in the slice, if the repetition rate is smaller than the repetition rate threshold value, the data can be stored in the data slice, the page is clicked by searching the button according to the configuration of the next page and the configuration of the paging, and the circulation is continued. If the retry times exceeds the set value or the button or button of which page turning does not occur can not be clicked, the whole acquisition is considered to be completed and the whole acquisition is withdrawn, and the cycle is ended and the data is stored.
208. And transmitting the link data and title information and detail page information corresponding to the link data to a database and storing the same.
Specifically, the acquired title and address, task id, detail page address, and website may be used as a structure, and the structure may be stored in a slice for use. Further, if the paging configuration is configured, the corresponding flow is also executed.
And finally, uniformly writing the contents in the slices into a database, establishing a next task flow, updating the time of the original task and the acquired link number, and identifying that the processing is completed.
Compared with the traditional crawling mode, each website needs to independently write a set of independent codes to be compatible, the implementation repetition rate is high, management is not good, data visualization is not achieved, and the updating difficulty is extremely high. The invention aims to solve the problems, and has the advantages of overall visualization, configurability, multiple compatibility, random capacity expansion, support of dynamic acquisition, static acquisition and the like to solve the current dilemma.
According to the interface data transmission method provided by the embodiment of the invention, the target webpage can be determined, the element information in the target webpage can be obtained, the acquisition mode is set according to the type and the element information of the target webpage, the acquisition mode comprises an html acquisition mode and a simulator acquisition mode, when the acquisition mode is the html acquisition mode, the link extraction rule is set to be a unified data format, the data is pulled according to the data format, the code of the header word string is obtained, the code of the header word string is modified according to the preset code format, the corresponding html code block is obtained according to the link extraction rule of the target webpage, the href attribute value is searched in the html code block to serve as the link data, when the acquisition mode is the simulator acquisition mode, all the link data in the simulator are acquired according to the position of at least one link in the target webpage, and the title information and the detail page information corresponding to the link data are transmitted to the database and stored in units of the element block. The scheme provided by the embodiment of the application can select the corresponding acquisition mode and the configuration parameters according to the target webpage, so that the webpage links are extracted rapidly, and the acquisition efficiency is improved effectively.
In order to implement the above method, the embodiment of the invention also provides an interface data transmission device, which can be integrated in terminal equipment such as a mobile phone, a tablet personal computer and the like.
For example, as shown in fig. 3, a first structural schematic diagram of an interface data transmission device according to an embodiment of the present invention is shown. The interface data transmission device may include:
a determining unit 301, configured to determine a target web page and obtain element information in the target web page;
a setting unit 302, configured to set an acquisition mode according to the type of the target webpage and the element information, where the acquisition mode includes an html acquisition mode and a simulator acquisition mode;
an acquisition unit 303, configured to set a configuration parameter according to the target web page, and acquire link data in the target web page based on the acquisition mode and the configuration parameter;
and a storage unit 304, configured to transmit the link data and title information and detail page information corresponding to the link data to a database and store the same.
In an embodiment, referring to fig. 4, fig. 4 is a schematic diagram of a second structure of an interface data transmission device according to an embodiment of the present invention, where the setting unit 302 specifically includes:
An obtaining subunit 3021, configured to obtain a security level of the target web page and an access frequency of a domain name corresponding to the target web page;
a calculation subunit 3022 configured to calculate the number of links included in the element information in the target web page;
a determining subunit 3023, configured to determine a target acquisition mode from the html acquisition mode and the simulator acquisition mode according to the security level, the access frequency, the number of links, and the weights corresponding to the access frequency and the number of links.
In one embodiment, the configuration parameters may include a data source update frequency of the target web page, a detail page access frequency of the target web page, a link extraction rule and a location of at least one link of the target web page, and login information of the target web page.
In an embodiment, with continued reference to fig. 4, the acquisition unit 303 may include:
and the first acquisition unit 3031 is configured to acquire a corresponding html code block according to a link extraction rule of the target webpage when the acquisition mode is an html acquisition mode, and search an href attribute value in the html code block to be used as link data.
And the second acquisition unit 3032 is used for loading the target webpage through the simulator when the acquisition mode is the simulator acquisition mode, and acquiring all link data in the target webpage by taking the element block as a unit based on the position of at least one link in the target webpage.
The interface data transmission device provided by the embodiment of the invention can determine the target webpage and acquire the element information in the target webpage, the acquisition mode is set according to the type of the target webpage and the element information, the acquisition mode comprises an html acquisition mode and a simulator acquisition mode, the configuration parameters are set according to the target webpage, the link data in the target webpage are acquired based on the acquisition mode and the configuration parameters, and the link data, the title information and the detail page information corresponding to the link data are transmitted to the database and stored. The scheme provided by the embodiment of the application can select the corresponding acquisition mode and the configuration parameters according to the target webpage, so that the webpage links are extracted rapidly, and the acquisition efficiency is improved effectively.
Embodiments of the present invention also provide a terminal, as shown in fig. 5, which may include a Radio Frequency (RF) circuit 601, a memory 602 including one or more computer readable storage media, an input unit 603, a display unit 604, a sensor 605, an audio circuit 606, a wireless fidelity (WiFi, wireless Fidelity) module 607, a processor 608 including one or more processing cores, and a power supply 609. It will be appreciated by those skilled in the art that the terminal structure shown in fig. 5 is not limiting of the terminal and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components. Wherein:
The RF circuit 601 may be used for receiving and transmitting signals during a message or a call, and in particular, after receiving downlink information of a base station, the downlink information is processed by one or more processors 608; in addition, data relating to uplink is transmitted to the base station. Typically, RF circuitry 601 includes, but is not limited to, an antenna, at least one Amplifier, a tuner, one or more oscillators, a subscriber identity module (SIM, subscriber Identity Module) card, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, the RF circuitry 601 may also communicate with networks and other devices through wireless communications. The wireless communication may use any communication standard or protocol including, but not limited to, global system for mobile communications (GSM, global System of Mobile communication), universal packet Radio Service (GPRS, generalPacket), code division Multiple Access (CDMA, codeDivision Multiple Access), wideband code division Multiple Access (WCDMA, widebandCode Division Multiple Access), long term evolution (LTE, long Term Evolution), email, short message Service (SMS, short MessagingService), and the like.
The memory 602 may be used to store software programs and modules, and the processor 608 may execute various functional applications and information processing by executing the software programs and modules stored in the memory 602. The memory 602 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data (such as audio data, phonebook, etc.) created according to the use of the terminal, etc. In addition, the memory 602 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 602 may also include a memory controller to provide access to the memory 602 by the processor 608 and the input unit 603.
The input unit 603 may be used to receive input numeric or character information and to generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control. In particular, in one particular embodiment, the input unit 603 may include a touch-sensitive surface, as well as other input devices. The touch-sensitive surface, also referred to as a touch display screen or a touch pad, may collect touch operations thereon or thereabout by a user (e.g., operations thereon or thereabout by a user using any suitable object or accessory such as a finger, stylus, etc.), and actuate the corresponding connection means according to a predetermined program. Alternatively, the touch-sensitive surface may comprise two parts, a touch detection device and a touch controller. The touch detection device detects the touch azimuth of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch detection device and converts it into touch point coordinates, which are then sent to the processor 608, and can receive commands from the processor 608 and execute them. In addition, touch sensitive surfaces may be implemented in a variety of types, such as resistive, capacitive, infrared, and surface acoustic waves. The input unit 603 may comprise other input devices in addition to a touch sensitive surface. In particular, other input devices may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, mouse, joystick, etc.
The display unit 604 may be used to display information input by a user or information provided to the user and various graphical user interfaces of the terminal, which may be composed of graphics, text, icons, video and any combination thereof. The display unit 604 may include a display panel, which may optionally be configured in the form of a liquid crystal display (LCD, liquid CrystalDisplay), an organic light Emitting Diode (OLED, organicLight-Emitting Diode), or the like. Further, the touch-sensitive surface may overlay a display panel, and upon detection of a touch operation thereon or thereabout, the touch-sensitive surface is passed to the processor 608 to determine the type of touch event, and the processor 608 then provides a corresponding visual output on the display panel based on the type of touch event. Although in fig. 5 the touch sensitive surface and the display panel are implemented as two separate components for input and output functions, in some embodiments the touch sensitive surface may be integrated with the display panel to implement the input and output functions.
The terminal may also include at least one sensor 605, such as a light sensor, a motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor that may adjust the brightness of the display panel according to the brightness of ambient light, and a proximity sensor that may turn off the display panel and/or backlight when the terminal moves to the ear. The gravity acceleration sensor can detect the acceleration in all directions (generally three axes), can detect the gravity and the direction when the mobile phone is stationary, can be used for identifying the gesture of the mobile phone (such as horizontal and vertical screen switching, related games, magnetometer gesture calibration), vibration identification related functions (such as pedometer and knocking), and other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, an infrared sensor and the like which are also configured by the terminal are not repeated herein.
Audio circuitry 606, speakers, and a microphone may provide an audio interface between the user and the terminal. The audio circuit 606 may transmit the received electrical signal after audio data conversion to a speaker, where the electrical signal is converted to a sound signal for output; on the other hand, the microphone converts the collected sound signals into electrical signals, which are received by the audio circuit 606 and converted into audio data, which are processed by the audio data output processor 608 for transmission to, for example, another terminal via the RF circuit 601, or which are output to the memory 602 for further processing. The audio circuit 606 may also include an ear bud jack to provide communication of the peripheral ear bud with the terminal.
The WiFi belongs to a short-distance wireless transmission technology, and the terminal can help the user to send and receive e-mail, browse web pages, access streaming media and the like through the WiFi module 607, so that wireless broadband internet access is provided for the user. Although fig. 5 shows a WiFi module 607, it is understood that it does not belong to the essential constitution of the terminal, and can be omitted entirely as required within the scope of not changing the essence of the invention.
The processor 608 is a control center of the terminal, and connects various parts of the entire mobile phone using various interfaces and lines, and performs various functions of the terminal and processes data by running or executing software programs and/or modules stored in the memory 602, and calling data stored in the memory 602, thereby performing overall monitoring of the mobile phone. Optionally, the processor 608 may include one or more processing cores; preferably, the processor 608 may integrate an application processor that primarily handles operating systems, user interfaces, applications, etc., with a modem processor that primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 608.
The terminal also includes a power supply 609 (e.g., a battery) for powering the various components, which may be logically connected to the processor 608 via a power management system so as to provide for managing charging, discharging, and power consumption by the power management system. The power supply 609 may also include one or more of any components, such as a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.
Although not shown, the terminal may further include a camera, a bluetooth module, etc., which will not be described herein. Specifically, in this embodiment, the processor 608 in the terminal loads executable files corresponding to the processes of one or more application programs into the memory 602 according to the following instructions, and the processor 608 executes the application programs stored in the memory 602, so as to implement various functions:
determining a target webpage and acquiring element information in the target webpage;
setting an acquisition mode according to the type of the target webpage and the element information, wherein the acquisition mode comprises an html acquisition mode and a simulator acquisition mode;
setting configuration parameters according to the target webpage, and acquiring link data in the target webpage based on the acquisition mode and the configuration parameters;
And transmitting the link data, and title information and detail page information corresponding to the link data to a database and storing the same.
In the foregoing embodiments, the descriptions of the embodiments are focused on, and the portions of an embodiment that are not described in detail may be referred to the detailed description of the interface data transmission method, which is not described herein.
As can be seen from the above, the terminal according to the embodiment of the present invention may determine the target web page and obtain the element information in the target web page, set the acquisition mode according to the type of the target web page and the element information, the acquisition mode includes an html acquisition mode and a simulator acquisition mode, set the configuration parameters according to the target web page, acquire the link data in the target web page based on the acquisition mode and the configuration parameters, and transmit the link data, and the title information and the detail page information corresponding to the link data to the database for storage. The scheme provided by the embodiment of the application can select the corresponding acquisition mode and the configuration parameters according to the target webpage, so that the webpage links are extracted rapidly, and the acquisition efficiency is improved effectively.
Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be performed by instructions, or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor.
To this end, an embodiment of the present invention provides a storage medium having stored therein a plurality of instructions capable of being loaded by a processor to perform the steps of any of the interface data transmission methods provided by the embodiments of the present invention. For example, the instructions may perform the steps of:
determining a target webpage and acquiring element information in the target webpage;
setting an acquisition mode according to the type of the target webpage and the element information, wherein the acquisition mode comprises an html acquisition mode and a simulator acquisition mode;
setting configuration parameters according to the target webpage, and acquiring link data in the target webpage based on the acquisition mode and the configuration parameters;
and transmitting the link data, and title information and detail page information corresponding to the link data to a database and storing the same.
The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.
Wherein the storage medium may include: read Only Memory (ROM), random access Memory (RAM, randomAccess Memory), magnetic disk or optical disk, and the like.
The instructions stored in the storage medium can execute the steps in any interface data transmission method provided by the embodiment of the present invention, so that the beneficial effects that any interface data transmission method provided by the embodiment of the present invention can be achieved, and detailed descriptions of the previous embodiments are omitted herein.
The foregoing describes in detail a method, apparatus, terminal and storage medium for transmitting interface data provided by the embodiments of the present invention, and specific examples are applied to describe the principles and implementations of the present invention, where the descriptions of the foregoing embodiments are only used to help understand the method and core idea of the present invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in light of the ideas of the present invention, the present description should not be construed as limiting the present invention.

Claims (10)

1. An interface data transmission method, comprising:
determining a target webpage and acquiring element information in the target webpage;
setting an acquisition mode according to the type of the target webpage and the element information, wherein the acquisition mode comprises an html acquisition mode and a simulator acquisition mode;
setting configuration parameters according to the target webpage, and acquiring link data in the target webpage based on the acquisition mode and the configuration parameters;
and transmitting the link data, and title information and detail page information corresponding to the link data to a database and storing the same.
2. The interface data transmission method as claimed in claim 1, wherein the step of setting the collection mode according to the type of the target web page and the element information comprises:
acquiring the security grade of the target webpage and the access frequency of the domain name corresponding to the target webpage;
calculating the number of links contained in the element information in the target webpage;
and determining a target acquisition mode from the html acquisition mode and the simulator acquisition mode according to the security level, the access frequency, the number of links and the weight corresponding to each access frequency and the number of links.
3. The interface data transmission method of claim 1, wherein the configuration parameters include a data source update frequency of the target web page, a detail page access frequency of the target web page, a link extraction rule and a location of at least one link of the target web page, and login information of the target web page.
4. The interface data transmission method of claim 3, wherein the step of collecting link data in the target web page based on the collection mode and the configuration parameter comprises:
when the acquisition mode is an html acquisition mode, acquiring a corresponding html code block according to a link extraction rule of the target webpage;
The href attribute value is looked up in the html code block as link data.
5. The interface data transmission method of claim 4, wherein before acquiring the corresponding html code block, the method further comprises:
setting the link extraction rule into a uniform data format;
and pulling data according to the data format, acquiring the code of the header word string, and modifying the code of the header word string in a preset coding format.
6. The interface data transmission method of claim 4, wherein after searching for an href attribute value in the html code block as link data, the method further comprises:
judging whether the address of the link data is empty or not;
if not, further judging whether the address of the link data starts with http;
if the http is not started, the detail page is spliced with the address of the link data according to the relative path configuration to serve as a new link data address.
7. The interface data transmission method of claim 3, wherein the step of collecting link data in the target web page based on the collection mode and the configuration parameter comprises:
When the acquisition mode is a simulator acquisition mode, loading the target webpage through a simulator;
and collecting all link data in the unit of element blocks based on the position of at least one link in the target webpage.
8. An interface data transmission device, comprising:
the determining unit is used for determining a target webpage and acquiring element information in the target webpage;
the setting unit is used for setting a collection mode according to the type of the target webpage and the element information, wherein the collection mode comprises an html collection mode and a simulator collection mode;
the acquisition unit is used for setting configuration parameters according to the target webpage and acquiring link data in the target webpage based on the acquisition mode and the configuration parameters;
and the storage unit is used for transmitting the link data, and title information and detail page information corresponding to the link data to a database and storing the same.
9. A terminal, the terminal comprising: a memory, a processor, wherein the memory has stored thereon an application handler, which when executed by the processor, implements the steps of the interface data transmission method according to any one of claims 1 to 7.
10. A storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the interface data transmission method of any one of claims 1 to 7.
CN202310042516.1A 2023-01-28 2023-01-28 Interface data transmission method, device, terminal and storage medium Pending CN116070052A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310042516.1A CN116070052A (en) 2023-01-28 2023-01-28 Interface data transmission method, device, terminal and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310042516.1A CN116070052A (en) 2023-01-28 2023-01-28 Interface data transmission method, device, terminal and storage medium

Publications (1)

Publication Number Publication Date
CN116070052A true CN116070052A (en) 2023-05-05

Family

ID=86179711

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310042516.1A Pending CN116070052A (en) 2023-01-28 2023-01-28 Interface data transmission method, device, terminal and storage medium

Country Status (1)

Country Link
CN (1) CN116070052A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117574010A (en) * 2023-11-03 2024-02-20 中信建投证券股份有限公司 Data acquisition method, device, equipment and storage medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150255086A1 (en) * 2014-03-07 2015-09-10 Ebay Inc. Interactive voice response interface for webpage navigation
CN106484828A (en) * 2016-09-29 2017-03-08 西南科技大学 A kind of distributed interconnection data Fast Acquisition System and acquisition method
CN107784113A (en) * 2017-11-08 2018-03-09 深圳市科盾科技有限公司 Html web page collecting method, device and computer-readable recording medium
CN108304498A (en) * 2018-01-12 2018-07-20 深圳壹账通智能科技有限公司 Webpage data acquiring method, device, computer equipment and storage medium
CN109413050A (en) * 2018-10-05 2019-03-01 国网湖南省电力有限公司 A kind of internet vulnerability information acquisition method that access rate is adaptive and system
CN109857956A (en) * 2019-01-25 2019-06-07 四川大学 The automatic abstracting method of news web page key message based on label and blocking characteristic
CN110334259A (en) * 2019-04-22 2019-10-15 新分享科技服务(深圳)有限公司 Webpage data acquiring method, device and computer readable storage medium
CN110489626A (en) * 2019-08-05 2019-11-22 苏州闻道网络科技股份有限公司 A kind of information collecting method and device
CN110929184A (en) * 2018-09-19 2020-03-27 北京国双科技有限公司 Link display method, system, storage medium and processor
CN111291288A (en) * 2020-01-22 2020-06-16 奇安信科技集团股份有限公司 Webpage link extraction method and system
CN113849718A (en) * 2021-09-28 2021-12-28 上海烟草集团有限责任公司 Internet tobacco science and technology information automatic acquisition device, method and storage medium

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150255086A1 (en) * 2014-03-07 2015-09-10 Ebay Inc. Interactive voice response interface for webpage navigation
CN106484828A (en) * 2016-09-29 2017-03-08 西南科技大学 A kind of distributed interconnection data Fast Acquisition System and acquisition method
CN107784113A (en) * 2017-11-08 2018-03-09 深圳市科盾科技有限公司 Html web page collecting method, device and computer-readable recording medium
CN108304498A (en) * 2018-01-12 2018-07-20 深圳壹账通智能科技有限公司 Webpage data acquiring method, device, computer equipment and storage medium
CN110929184A (en) * 2018-09-19 2020-03-27 北京国双科技有限公司 Link display method, system, storage medium and processor
CN109413050A (en) * 2018-10-05 2019-03-01 国网湖南省电力有限公司 A kind of internet vulnerability information acquisition method that access rate is adaptive and system
CN109857956A (en) * 2019-01-25 2019-06-07 四川大学 The automatic abstracting method of news web page key message based on label and blocking characteristic
CN110334259A (en) * 2019-04-22 2019-10-15 新分享科技服务(深圳)有限公司 Webpage data acquiring method, device and computer readable storage medium
CN110489626A (en) * 2019-08-05 2019-11-22 苏州闻道网络科技股份有限公司 A kind of information collecting method and device
CN111291288A (en) * 2020-01-22 2020-06-16 奇安信科技集团股份有限公司 Webpage link extraction method and system
CN113849718A (en) * 2021-09-28 2021-12-28 上海烟草集团有限责任公司 Internet tobacco science and technology information automatic acquisition device, method and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117574010A (en) * 2023-11-03 2024-02-20 中信建投证券股份有限公司 Data acquisition method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN106970790B (en) Application program creating method, related equipment and system
CN106708496B (en) Processing method and device for label page in graphical interface
CN111125269B (en) Data management method, blood relationship display method and related device
CN108156508B (en) Barrage information processing method and device, mobile terminal, server and system
US20150091935A1 (en) Method and device for browsing web under weak light with mobile terminal browser
CN104104711B (en) Reading histories treating method and apparatus
CN103279574A (en) Method, device and terminal device for loading explorer pictures
CN103513987A (en) Rendering treatment method, device and terminal device for browser web page
CN103699595A (en) Method and device for webpage caching of terminal browser and terminal
CN104182429A (en) Web page processing method and terminal
CN111078986B (en) Data retrieval method, device and computer readable storage medium
CN107247691A (en) A kind of display methods of text message, device, mobile terminal and storage medium
CN108073647B (en) Webpage display method and device
CN110032493A (en) Monitoring method, device, terminal and the readable storage medium storing program for executing of the page
CN105955597A (en) Method and device for displaying information
CN114357278B (en) Topic recommendation method, device and equipment
CN105868319B (en) Webpage loading method and device
CN110674444B (en) Method and terminal for downloading dynamic webpage
CN108984374B (en) Method and system for testing database performance
CN116070052A (en) Interface data transmission method, device, terminal and storage medium
CN108763297A (en) Web page resources processing method, device and mobile terminal
WO2015096660A1 (en) Methods and devices for displaying a webpage
US10140265B2 (en) Apparatuses and methods for phone number processing
CN108268232B (en) Picture display method, device, system and storage medium
CN105095161B (en) Method and device for displaying rich text information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20230505

RJ01 Rejection of invention patent application after publication