CN108334560B

CN108334560B - Information acquisition method and related equipment

Info

Publication number: CN108334560B
Application number: CN201810009236.XA
Authority: CN
Inventors: 王策; 张锋
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-01-03
Filing date: 2018-01-03
Publication date: 2022-04-15
Anticipated expiration: 2038-01-03
Also published as: CN108334560A

Abstract

The embodiment of the invention discloses an information acquisition method and related equipment, which comprises the following steps: firstly, acquiring a first traversal path of an attribute name and a second traversal path of an attribute value; then obtaining the attribute name from page information according to the first traversal path and obtaining the attribute value from the page information according to the second traversal path; and then establishing a mapping relation between the attribute name and the attribute value as an information acquisition result to be output. By adopting the embodiment of the invention, the accuracy of information acquisition can be improved.

Description

Information acquisition method and related equipment

Technical Field

The invention relates to the technical field of computers, in particular to an information acquisition method and related equipment.

Background

At present, information carriers on the internet are mainly texts, information contained in the texts can be structured in an information acquisition mode to be changed into an organization form like a table, and original texts are input into an information acquisition system, such as: the web page data or the single text content outputs the information points with fixed format. Information points are obtained from a variety of documents and then integrated together in a unified fashion, in which way information can be efficiently obtained from a large number of documents. Information acquisition is generally realized based on an extensible markup Language Path (Xml Path Language, XPath), attribute names of information are fixed in the current information acquisition method, XPath is configured only for attribute values corresponding to the attribute names of required information, and specific contents of the attribute values are acquired in a document structured model (Dom tree) corresponding to a text through the XPath. For example, as shown in fig. 1, the info box information of encyclopedia entry "XXX" is shown, where "company name", "foreign language name" and the like are attribute names, and "Shenzhen city XXX finite company", "ABC" are corresponding attribute values, and when the information constituting the info box is obtained, "company name" and "foreign language name" are fixed, and "Shenzhen city XXX finite company" and "ABC" are obtained from DOM tree corresponding to HTML text content of an encyclopedia page of "XXX" through XPath thereof.

However, since the degree of coincidence of the attribute values is high and the attribute names are greatly different in different pages, for example, the corresponding attribute name of the attribute value "internet" in fig. 1 is "business range", but in the infobox information of the encyclopedia "internet", the attribute name of "internet" is "chinese name". Therefore, the fixed attribute name is obtained by only using XPath method for the attribute value, which results in low accuracy of information acquisition.

Disclosure of Invention

The embodiment of the invention provides an information acquisition method and related equipment. The accuracy of information acquisition can be improved.

A first aspect of the present invention provides an information acquisition method, including:

acquiring a first traversal path of the attribute name and a second traversal path of the attribute value;

acquiring the attribute name from page information according to the first traversal path and acquiring the attribute value from the page information according to the second traversal path;

and establishing a mapping relation between the attribute name and the attribute value, and outputting the mapping relation as an information acquisition result.

Wherein the establishing of the mapping relationship between the attribute name and the attribute value comprises:

acquiring a first mapping label of the attribute name and a second mapping label of the attribute value;

and establishing a mapping relation between the attribute name and the attribute value according to the first mapping label and the second mapping label.

Wherein the establishing a mapping relationship between the attribute name and the attribute value according to the first mapping label and the second mapping label comprises:

and when the first mapping label is the same as the second mapping label, establishing the mapping relation between the attribute name and the attribute value.

Wherein the obtaining the attribute name from the page information according to the first traversal path and obtaining the attribute value from the page information according to the second traversal path includes:

creating a structure traversal tree according to the page information, wherein the structure traversal tree comprises a plurality of content nodes;

and traversing the plurality of content nodes on the structure traversal tree, acquiring the attribute name according to the first traversal path and acquiring the attribute value according to the second traversal path.

Wherein the obtaining of the first traversal path of the attribute name and the second traversal path of the attribute value includes:

acquiring the attribute name and the attribute identification of the attribute value;

and acquiring the first traversal path and the second traversal path from a configuration file according to the attribute identification, wherein the configuration file comprises the attribute identification and the corresponding relation between the attribute identification and the first traversal path and the second traversal path.

Wherein the obtaining of the attribute name and the attribute identifier of the attribute value includes:

acquiring a uniform resource locator of the page information;

and acquiring the attribute name and the attribute identifier of the attribute value according to the uniform resource locator.

Wherein the obtaining the attribute name from the page information according to the first traversal path includes:

determining the type of the attribute name;

and if the attribute name is the open attribute name, acquiring the attribute name according to the first traversal path of the attribute name.

Accordingly, a second aspect of the present invention provides an information acquisition apparatus comprising:

the path acquisition module is used for acquiring a first traversal path of the attribute name and a second traversal path of the attribute value;

the information acquisition module is used for acquiring the attribute name from the page information according to the first traversal path and acquiring the attribute value from the page information according to the second traversal path;

and the result output module is used for establishing the mapping relation between the attribute name and the attribute value and outputting the mapping relation as an information acquisition result.

Wherein the result output module is specifically configured to:

The information acquisition module is specifically configured to:

Wherein the path acquisition module is specifically configured to:

acquiring a uniform resource locator of the page information;

The information acquisition module is specifically configured to:

determining the type of the attribute name;

In a third aspect, the present invention provides an information acquisition apparatus, including: the information acquisition method comprises a processor, a memory and a communication bus, wherein the communication bus is used for realizing connection communication between the processor and the memory, and the processor executes a program stored in the memory for realizing the steps in the information acquisition method provided by the first aspect.

In one possible design, the information acquisition device provided by the invention may include corresponding modules for executing the method. The modules may be software and/or hardware.

Yet another aspect of the present invention provides a computer-readable storage medium having stored thereon a plurality of instructions adapted to be loaded by a processor and to perform the method of the above-described aspects.

Yet another aspect of the present invention provides a computer program product containing instructions which, when run on a computer, cause the computer to perform the method of the above aspects.

The embodiment of the invention is implemented by firstly obtaining a first traversal path of an attribute name and a second traversal path of an attribute value; then obtaining the attribute name from page information according to the first traversal path and obtaining the attribute value from the page information according to the second traversal path; and finally, establishing a mapping relation between the attribute name and the attribute value as an information acquisition result to be output. And the traversal path is used for acquiring the attribute name and the attribute value, and the attribute name and the attribute value are mapped, so that the accuracy of information acquisition is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a diagram illustrating an information acquisition result provided by a prior art solution;

fig. 2 is a schematic structural diagram of an information acquisition system according to an embodiment of the present invention;

fig. 3 is a schematic flowchart of an information obtaining method according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a DOM tree according to an embodiment of the present invention;

fig. 5 is a schematic flowchart of another information acquisition method according to an embodiment of the present invention;

fig. 6 is a schematic flowchart of another information acquisition method according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of an information acquisition apparatus according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of an information acquisition apparatus according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 2, fig. 2 is a schematic structural diagram of an information acquisition system according to an embodiment of the present invention, where the information acquisition system includes a user device 201 and a server 202. Server 202 may be, among other things, a Web site (Web) server capable of providing Web browsing services. User device 201 may refer to a device that provides voice and/or data connectivity to a user, may also be connected to a computing device such as a laptop or desktop computer, or it may be a standalone device such as a Personal Digital Assistant (PDA). The server is configured to receive a service request sent by the user equipment, where the service request is used to request to browse page information, then parse a Uniform Resource Locator (URL) of the page information, obtain a traversal path of an attribute name and a traversal path of an attribute value from a configuration file, finally obtain the attribute name according to the traversal path of the attribute name, obtain the attribute value according to the traversal path of the attribute value, and finally establish a correspondence between the attribute name and the attribute value as an information obtaining result and send the information obtaining result to the user equipment. The user equipment is used for sending a service request to the server and obtaining the information obtaining result from the server for displaying.

Based on the information acquisition system, as shown in fig. 3, an information acquisition method provided by an embodiment of the present invention includes: s301, the system loads a configuration file pattern and a configuration file xpath.conf, where the pattern.conf file includes a presentation pattern (pattern) and an ID number (pattern _ ID) of the presentation pattern corresponding to the presentation pattern (pattern), for example, as shown in table 1, the pattern.conf file includes a pattern _ ID 0, a pattern _ ID 1, and a corresponding pattern. Conf file includes pattern _ id, attribute name, and XPath of attribute value. For example, as shown in table 2, XPath including two attribute names of "name" and "profile" and their corresponding attribute values is provided under pattern _ id 0, and XPath including an attribute name "tag" and corresponding two attribute values is provided under pattern _ id 1. S302, obtain a pattern _ id from a pattern.conf file by parsing the URL of the page information, and then obtain an XPath of the attribute value from an xpath.conf file according to the pattern _ id. S303, creating a DOM tree of the page information, traversing XPath of each attribute value under pattern _ id, and acquiring node contents from the DOM tree as corresponding attribute values according to the XPath. S304, according to the corresponding relationship between the attribute names and the attribute values, the attribute names and the attribute values are output in the form of < attribute name, attribute value > or the like, and if one attribute name corresponds to M attribute values, the attribute names and the attribute values may be output in the form of < attribute name, attribute value 1, attribute value 2, …, attribute value M > or the like, for example, an attribute name "tag" may correspond to two attribute values "stock name" and "company" and then < tag, stock name, company > or the like may be output.

Table 1.pattern. conf file

pattern_id	pattern
		0	^https://baike\.baidu\.com/item/.+/\d+$
1	^https://baike\.baidu\.com/subview/\d+/\d+\.htm$

Table 2.xpath.conf file

However, since the coincidence degree of the attribute values is high in different pages and the attribute names are greatly different, the attribute names corresponding to the same attribute values of the XPath may be completely different in different pages, and the accuracy of information acquisition is low because the fixed attribute names are obtained only by using the XPath method for the attribute values. In order to solve this problem, the present invention proposes the following solution.

Referring to fig. 4, fig. 4 is a schematic flow chart diagram of another information obtaining method according to an embodiment of the present invention, where the method includes, but is not limited to, the following steps:

s401, a first traversal path of the attribute name and a second traversal path of the attribute value are obtained.

In specific implementation, a service request sent by user equipment may be received first, where the service request is used to request page information, and then a URL of the page information is obtained; and finally, acquiring a first traversal path and a second traversal path from a configuration file according to the attribute identification, wherein the configuration file comprises the attribute identification and the corresponding relation with the first traversal path and the second traversal path.

The system includes a configuration file pattern and a configuration file xpath, the configuration file pattern includes a pattern and a pattern _ id corresponding to the pattern, for example, as shown in table 1, the pattern. The configuration file xpath.conf includes a pattern _ id, an attribute name, and an XPath of an attribute value, where the attribute name in the xpath.conf file includes a specific attribute name and an attribute name in an XPath form, where the attribute name in the XPath form is an open attribute name, the attribute value corresponding to the open attribute name is an open attribute value, and it needs to be described that one open attribute name has only one corresponding open attribute value. For example, as shown in Table 3, pattern _ id 0 corresponds to two attribute names, the first being a specific attribute name, and as shown in the first row of Table 3, the attribute name "is a specific attribute name, and XPath"/html/body/div [4]/div [2]/div/div [2]/dd/h1 "corresponding to the attribute value; the second is an attribute name in the form of XPath, where "/html/body/div [4]/div [2]/div/dl [1]/dt [1 ]" in the attribute name is the attribute name in the form of XPath, corresponding to the attribute values XPath "/html/body/div [4]/div [2]/div/dl [1]/dd [1 ]", as shown in the second line of Table 3.

Table 3. modified xpath.conf file

For example, after receiving a service request, first loading a configuration file pattern and a configuration file xpath.conf, then obtaining a URL of page information requested by a user equipment, analyzing the URL of a page through a regular expression, performing a matching query on the configuration file pattern to obtain a corresponding pattern and a pattern _ id, generating a pattern _ id list, and obtaining an XPath of a specific attribute name and a corresponding attribute value, and an XPath of an open attribute name and an XPath of a corresponding open attribute value from the configuration file xpath.conf according to the pattern _ id list. For example, the configuration information list shown in table 4 may be generated by first obtaining a pattern _ id from a configuration file pattern shown in table 1, and then obtaining an attribute name, an XPath of an attribute value, or an XPath of an attribute name from an XPath. The configuration information list includes pattern _ id 0, specific attribute name "and XPath"/html/body/div [4]/div [2]/div/div [2]/dd/h1 "of the corresponding attribute value, XPath"/html/body/div [4]/div [2]/div/dl [1]/dt [1 "of the open attribute name, and"/html/body/div [4]/div [2]/div/dl [1]/dd [1 "of XPath of the corresponding open attribute value.

TABLE 4 configuration information List

pattern_id	Attribute name/attribute value	XPath
			0	Name (R)	/html/body/div[4]/div[2]/div/div[2]/dd/h1
0	Open attribute names	/html/body/div[4]/div[2]/div/dl[1]/dt[2]
			0	Open attribute value	/html/body/div[4]/div[2]/div/dl[1]/dd[2]

S402, obtaining the attribute name from the page information according to the first traversal path, and obtaining the attribute value from the page information according to the second traversal path.

In a specific implementation, a structure traversal tree may be created according to the page information, where the structure traversal tree includes a plurality of content nodes; and traversing the plurality of content nodes on the structure traversal tree, acquiring the attribute name according to the first traversal path and acquiring the attribute value according to the second traversal path.

Optionally, before the attribute name is obtained from the page information according to the first traversal path, the type of the attribute name may be determined; and if the attribute name is determined to be an open attribute name (in an Xpath form), acquiring the attribute name from page information according to the first traversal path. If the attribute name is a specific attribute name, such as "company business", "development history", etc., the "company business", "development history" may be determined as the attribute name, and thus it is not necessary to acquire the attribute name from the page information according to the first traversal path in this case.

For example, as shown in FIG. 5, the HTML page information is parsed by the DOM, generating a corresponding DOM tree. The DOM tree contains a plurality of content nodes, and each content node is represented as an HTML mark or text content in the HTML mark. After the DOM tree is created, the content nodes are traversed in the DOM tree according to the XPath in the configuration information list shown in table 4, and the corresponding node content is acquired as the information value of the XPath. For example, when the XPath is html/head/title, the html node, the head node and the title node in the DOM Tree shown in fig. 5 may be sequentially traversed according to html/head/title, then the text content "My title" of the title node is obtained as the information value of the XPath, the information value of each XPath is respectively obtained according to different traversal paths according to the method, and finally an attribute information list shown in table 5 is generated, where the attribute information list includes a specific attribute name "and a corresponding XPath information value" XXX ", an open attribute name and a corresponding XPath information value" foreign language name ", an open attribute value and a corresponding XPath information value" ABC ".

TABLE 5 Attribute information List

Attribute name/attribute value	XPath information value
		Name (R)	XXX
Open attribute names	Foreign language name
		Open attribute value	ABC

And S403, establishing a mapping relation between the attribute name and the attribute value, and outputting the mapping relation as an information acquisition result.

In the concrete implementation, if the attribute name is a specific attribute name, the XPath information value corresponding to the specific attribute name is used as the attribute value corresponding to the specific attribute name, and if the attribute name is an open attribute name, the XPath information value of the open attribute name and the XPath information value of the open attribute value are mapped, and output according to the format of < attribute name, attribute value >.

For example: in the attribute information list shown in table 5, the attribute value corresponding to the specific attribute name "is" XXX ", the attribute value corresponding to the open attribute name" foreign name "is the XPath information value" ABC "of the open attribute value, and they can be output as: < name, XXX >, < foreign language name, ABC >.

In the embodiment of the invention, a first traversal path of an attribute name and a second traversal path of an attribute value are obtained firstly; then obtaining the attribute name from page information according to the first traversal path and obtaining the attribute value from the page information according to the second traversal path; and finally, establishing a mapping relation between the attribute name and the attribute value as an information acquisition result to be output. And the traversal path is used for acquiring the attribute name and the attribute value, and the attribute name and the attribute value are mapped, so that the accuracy of information acquisition is improved.

Referring to fig. 6, fig. 6 is a schematic flow chart diagram of another information obtaining method according to an embodiment of the present invention, where the method includes, but is not limited to, the following steps:

s601, acquiring a first traversal path of the attribute name and a second traversal path of the attribute value.

The system includes a configuration file pattern and a configuration file xpath, the configuration file pattern includes a pattern and a pattern _ id corresponding to the pattern, for example, as shown in table 1, the pattern. The configuration file xpath.conf includes a pattern _ id, an attribute name, and an XPath of an attribute value, where the attribute name in the xpath.conf file includes a specific attribute name and an attribute name in an XPath form, where the attribute name in the XPath form is an open attribute name, the attribute value corresponding to the open attribute name is an open attribute value, and it needs to be described that one open attribute name has only one corresponding open attribute value. For example, as shown in Table 6, pattern _ id 0 corresponds to two attribute names, the first being a specific attribute name, and as shown in the first row of Table 6, the attribute name "is a specific attribute name, and XPath"/html/body/div [4]/div [2]/div/div [2]/dd/h1 "corresponding to the attribute value; the second is an attribute name in the form of XPath, where "/html/body/div [4]/div [2]/div/dl [1]/dt [1 ]" and "/html/body/div [4]/div [2]/div/dl [1]/dt [2 ]" in the attribute names are attribute names in the form of XPath, and "/html/body/div [4]/div [2]/div/dl [1]/dd [1 ]" and "/html/body/div [4]/div [2]/div/d1[1 ]", which correspond to the attribute values, are shown in the second and third rows in Table 6.

Table 6. modified xpath.conf file

For example, after receiving a service request, first loading a configuration file pattern and a configuration file xpath.conf, then obtaining a URL of page information requested by a user device, parsing the URL of a page through a regular expression, performing a matching query on the configuration file pattern to obtain a corresponding pattern and a pattern _ id, generating a pattern _ id list, obtaining a specific attribute name and an XPath of a corresponding attribute value, and obtaining an XPath of an open attribute name and an XPath of a corresponding open attribute value from the configuration file xpath.conf according to the pattern _ id list, and if n open attribute names are shared, respectively naming the n open attribute names as open attribute name _1, open attribute name _2, …, open attribute name _ n, corresponding open attribute value as open attribute value _1, open attribute value _2, …, and open attribute value _ n. For example, the configuration information list shown in table 7 may be generated by first obtaining a pattern _ id from a configuration file pattern shown in table 1, and then obtaining an attribute name, an XPath of an attribute value, or an XPath of an attribute name from an XPath. The configuration information list includes pattern _ id 0, a specific attribute name "and the corresponding attribute value XPath"/html/body/div [4]/div [2]/div/div [2]/dd/h1 ", XPath"/html/body/div [4]/div [2]/div/dl [1]/dt [1 "of the open attribute name _1 and the corresponding open attribute value XPath"/html/body/div [4]/div [2]/div/dl [1]/dd [1] "of the open attribute value _1, XPath"/html/body/div [4]/div [2]/div/dl [1] ", XPath"/html/body/div [2 ]/div/1 [2 ]/dl [2] ", and the corresponding open attribute value"/XPath "/pht ]/div/div [4]/div [2 ]/ddv [1 ]/ddl [1 ]", and the corresponding open attribute value XPath "/div [2 ]/div/ddl [4 ]/ddl [1 ]" 2]".

TABLE 7 configuration information List

pattern_id	Attribute name/attribute value	XPath
			0	Name (R)	/html/body/div[4]/div[2]/div/div[2]/dd/h1
0	Open attribute name _1	/html/body/div[4]/div[2]/div/dl[1]/dt[1]
			0	Open attribute value _1	/html/body/div[4]/div[2]/div/dl[1]/dd[1]
0	Open attribute name _2	/html/body/div[4]/div[2]/div/dl[1]/dt[2]
			0	Open attribute value _2	/html/body/div[4]/div[2]/div/dl[1]/dd[2]

S602, obtaining the attribute name from the page information according to the first traversal path, and obtaining the attribute value from the page information according to the second traversal path.

For example, as shown in FIG. 5, the HTML page information is parsed by the DOM, generating a corresponding DOM tree. The DOM tree contains a plurality of content nodes, and each content node is represented as an HTML mark or text content in the HTML mark. After the DOM tree is created, the content nodes are traversed in the DOM tree according to the XPath in the configuration information list shown in table 7, and the corresponding node content is acquired as the information value of the XPath. For example, when the XPath is html/head/title, the html node, the head node and the title node in the DOM Tree shown in fig. 5 may be traversed sequentially according to html/head/title, then the text content "My title" of the title node is obtained as the information value of the XPath, the information value of each XPath is obtained according to the method and different traversal paths, and finally the attribute information list shown in table 8 is generated, where the attribute information list includes the specific attribute name "and the corresponding XPath information value" XXX ", the open attribute name _1 and the corresponding XPath information value" foreign language name ", the open attribute value _1 and the corresponding information value" ABC "of the XPath, the open attribute name _2 and the corresponding XPath information value" headquarter place ", the open attribute value _2 and the corresponding information value" chinese depth ".

TABLE 8 Attribute information List

Attribute name/attribute value	XPath information value
		Name (R)	XXX
Open attribute name _1	Foreign language name
		Open attribute value _1	ABC
Open attribute name _2	Headquarters location
		Open attribute value _2	Shenzhen of China

S603, acquiring a first mapping label of the attribute name and a second mapping label of the attribute value.

In a specific implementation, if the attribute name/attribute value is an open attribute name _ n or an open attribute value _ n, a number "n" in the open attribute name _ n may be obtained as a first mapping tag of an information value of a corresponding XPath, and a number "n" in the open attribute value _ n may be obtained as a second mapping tag of an information value of a corresponding XPath, where n may be any integer such as 1,2,3 …. For example, in the attribute information list shown in table 8, the number "1" in the open attribute name _1 is obtained as the first mapping tag of the corresponding XPath information value "foreign language name", and the number "1" in the open attribute value _1 is obtained as the second mapping tag of the corresponding XPath information value "shenzhen, china.

S604, according to the first mapping label and the second mapping label, establishing a mapping relation between the attribute name and the attribute value, and outputting an information acquisition result.

In a specific implementation, if the attribute name is a specific attribute name, the XPath information value corresponding to the specific attribute name is used as the attribute value corresponding to the specific attribute name, and they can be expressed as the following attribute values: attribute value >, for example, in the attribute information list shown in table 8, the attribute value corresponding to the specific attribute name "is" XXX ", and they are output: < name, XXX >.

If the attribute name/attribute value is an open attribute name _ n or an open attribute value _ n, storing an information value of XPath corresponding to the open attribute name _ n as an attribute name in the nth position of the open attribute name list; similarly, the information value of XPath corresponding to the open attribute value _ n is stored as an attribute value in the nth position of the open attribute value list, and the open attribute name _1 to the open attribute name _ n and the open attribute value _1 to the open attribute value _ n in the attribute information list are traversed. And finally, when the first mapping label is the same as the second mapping label, establishing a mapping relation between the attribute name corresponding to the first mapping label and the attribute value corresponding to the second mapping label, and outputting the attribute name and the attribute value according to a form of < attribute name, attribute value >.

For example, as shown in table 9-1 and table 9-2, the first mapping tag of the attribute name "foreign language name" in the open attribute name list is 1, and the second mapping tag of the attribute value "ABC" in the open attribute value list is 1, and therefore, the first mapping tag of the attribute name "foreign language name" is the same as the second mapping tag of the attribute value "ABC", thereby establishing a mapping relationship of "foreign language name" and "ABC", and placing them as < foreign language name: ABC > is output in a form. Similarly, the first mapping tag of the attribute name "headquarter place" is 2, and the second mapping tag of the attribute value "shenzhen in china" is also 2, so that the mapping relationship between the "headquarter place" and the "shenzhen in china" can be established, and < headquarter place is output: shenzhen, China.

Referring to fig. 7, fig. 7 is a schematic structural diagram of an information acquisition apparatus according to an embodiment of the present invention, where the information acquisition apparatus may include:

the path obtaining module 701 is configured to obtain a first traversal path of the attribute name and a second traversal path of the attribute value.

An information obtaining module 702, configured to obtain the attribute name from the page information according to the first traversal path, and obtain the attribute value from the page information according to the second traversal path.

And a result output module 703, configured to establish a mapping relationship between the attribute name and the attribute value, and output the mapping relationship as an information acquisition result.

If the attribute name/attribute value is an open attribute name _ n or an open attribute value _ n, first, a number "n" in the open attribute name _ n is obtained as a first mapping tag of an information value of a corresponding XPath, an information value of an XPath corresponding to the open attribute name _ n is stored as an attribute name in an nth position of the open attribute name list, similarly, a number "n" in the open attribute value _ n is obtained as a second mapping tag of an information value of a corresponding XPath, and an information value of an XPath corresponding to the open attribute value _ n is stored as an attribute value in an nth position of the open attribute value list, where n may be any integer of 1,2,3 …, and the open attribute name _1 to the open attribute name _ n and the open attribute value _1 to the open attribute value _ n in the attribute information list are traversed. For example, in the attribute information list shown in table 8, the first mapping tag that acquires the number "1" in the open attribute name _1 as the corresponding XPath information value "foreign language name" is 1, and stores the "foreign language name" as the attribute name in the 1 st position of the open attribute name list, the number "1" in the open attribute value _1 as the second mapping tag of the corresponding XPath information value "chinese shenzhen", and stores the "chinese shenzhen" as the attribute value in the 1 st position of the open attribute value list.

And finally, when the first mapping label is the same as the second mapping label, establishing a mapping relation between the attribute name corresponding to the first mapping label and the attribute value corresponding to the second mapping label, and outputting the attribute name and the attribute value according to a form of < attribute name, attribute value >.

For example, as shown in tables 9-1 and 9-2, the first mapping tag of the attribute name "foreign language name" in the open attribute name list is 1, and the second mapping tag of the attribute value "ABC" in the open attribute value list is 1, so that the first mapping tag of the attribute name "foreign language name" is the same as the second mapping tag of the attribute value "ABC", thereby establishing a mapping relationship of "foreign language name" and "ABC", and assigning them as < foreign language name: ABC > is output in a form. Similarly, the first mapping tag of the attribute name "headquarter place" is 2, and the second mapping tag of the attribute value "shenzhen in china" is also 2, so that the mapping relationship between the "headquarter place" and the "shenzhen in china" can be established, and < headquarter place is output: shenzhen, China.

Please refer to fig. 8, fig. 8 is a schematic structural diagram of an information obtaining apparatus according to an embodiment of the present invention. As shown in the figure, the information acquisition apparatus may include: at least one processor 801, at least one communication interface 802, at least one memory 803, and at least one communication bus 804.

The processor 801 may be, among other things, a central processing unit, a general purpose processor, a digital signal processor, an application specific integrated circuit, a field programmable gate array or other programmable logic device, transistor logic, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor may also be a combination of computing functions, e.g., comprising one or more microprocessors, a digital signal processor and a microprocessor, or the like. The communication bus 804 may be a peripheral component interconnect standard PCI bus or an extended industry standard architecture EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 8, but this is not intended to represent only one bus or type of bus. A communication bus 804 is used to enable communications among the components. The communication interface 802 of the device in the embodiment of the present invention is used for performing signaling or data communication with other node devices. The Memory 803 may include a volatile Memory, such as a Nonvolatile dynamic Random Access Memory (NVRAM), a Phase Change Random Access Memory (PRAM), a Magnetoresistive Random Access Memory (MRAM), and the like, and may further include a Nonvolatile Memory, such as at least one magnetic Disk Memory device, an Electrically Erasable Programmable Read-Only Memory (EEPROM), a flash Memory device, such as a NOR flash Memory (NOR flash Memory) or a NAND flash Memory (EEPROM), a semiconductor device, such as a Solid State Disk (SSD), and the like. The memory 803 may optionally be at least one memory device located remotely from the processor 801 as previously described. A set of program codes is stored in the memory 803 and the processor 801 executes the programs in the memory 803:

Optionally, the processor 801 is further configured to perform the following operation steps:

acquiring a uniform resource locator of the page information;

determining the type of the attribute name;

Further, the processor may cooperate with the memory and the communication interface to perform the operations of the resource management apparatus in the above embodiments of the invention.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

The above-mentioned embodiments further explain the objects, technical solutions and advantages of the present invention in detail. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An information acquisition method, characterized in that the method comprises:

acquiring a first traversal path of an attribute name and a second traversal path of an attribute value from a configuration file, wherein the type of the attribute name is an open attribute name, the attribute name and the attribute value are respectively stored in the configuration file in the form of traversal paths, and the first traversal path and the second traversal path are used for traversing a structural traversal tree created according to page information;

acquiring a first mapping label of the attribute name and a second mapping label of the attribute value, wherein the first mapping label is used for representing the arrangement position of the attribute name in a corresponding attribute name list, and the second mapping label is used for representing the arrangement position of the attribute value in a corresponding attribute value list;

and when the first mapping label is the same as the second mapping label, establishing a mapping relation between the attribute name and the attribute value as an information acquisition result and outputting the information acquisition result.

2. The method of claim 1, wherein the obtaining the attribute name from page information according to the first traversal path and the attribute value from the page information according to the second traversal path comprises:

3. The method of claim 1, wherein the obtaining a first traversal path for an attribute name and a second traversal path for an attribute value comprises:

4. The method of claim 3, wherein said obtaining the attribute name and the attribute identification of the attribute value comprises:

acquiring a uniform resource locator of the page information;

5. The method of any of claims 1-4, wherein the obtaining the attribute name from page information according to the first traversal path comprises:

determining the type of the attribute name;

6. An information acquisition apparatus, characterized in that the apparatus comprises:

the system comprises a path acquisition module, a configuration file and a processing module, wherein the path acquisition module is used for acquiring a first traversal path of an attribute name and a second traversal path of an attribute value from the configuration file, the type of the attribute name is an open attribute name, the attribute name and the attribute value are respectively stored in the configuration file in a traversal path mode, and the first traversal path and the second traversal path are used for traversing a structure traversal tree created according to page information;

and the result output module is used for acquiring a first mapping label of the attribute name and a second mapping label of the attribute value, wherein the first mapping label is used for representing the arrangement position of the attribute name in a corresponding attribute name list, the second mapping label is used for representing the arrangement position of the attribute value in a corresponding attribute value list, and when the first mapping label is the same as the second mapping label, the mapping relation between the attribute name and the attribute value is established and is output as an information acquisition result.

7. The apparatus of claim 6, wherein the information acquisition module is specifically configured to:

8. The apparatus of claim 6, wherein the path acquisition module is specifically configured to:

9. The apparatus of claim 8, wherein the path acquisition module is specifically configured to:

acquiring a uniform resource locator of the page information;

10. The apparatus according to any one of claims 6 to 9, wherein the information obtaining module is specifically configured to:

determining the type of the attribute name;

11. A computer-readable storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the method according to any one of claims 1 to 5.