CN102043808A

CN102043808A - Method and equipment for extracting bilingual terms using webpage structure

Info

Publication number: CN102043808A
Application number: CN2009102048042A
Authority: CN
Inventors: 刘秋阁; 方高林
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Shenzhen Tencent Computer Systems Co Ltd
Priority date: 2009-10-14
Filing date: 2009-10-14
Publication date: 2011-05-04
Anticipated expiration: 2029-10-14
Also published as: CN102043808B

Abstract

The present invention discloses a method for extracting bilingual terms using the structure of the webpage, comprising: searching webpages in a search engine according to a predetermined seed term and saving the webpages; extracting the format of the seed term appearing on the said webpages and extracting other bilingual terms, which have the same format as the format of the seed term, from the webpages. In the present invention, relevant webpages are searched in a search engine and saved according to a predetermined seed term, and then the format of the seed term in the searched webpages is extracted, and other bilingual terms with the same format as that of the seed term from the searched webpages are also extracted, thereby increasing the extraction efficiency of bilingual terms from webpages.

Description

Utilize structure of web page to extract the method and apparatus of bilingual term

Technical field

The present invention relates to the communications field, relate in particular to a kind of method and apparatus that utilizes structure of web page to extract bilingual term.

Background technology

E-dictionary is a kind of traditional printing dictionary is changed into digital mode, carries out the digital studying instrument of fast query, characteristics such as computer dictionary is quick with light portable, inquiry, feature richness, and the use in people's studying and living is more and more widely.But the dictionary in the existing bilingual electronic dictionary generally depends on artificial input and editor, and workload is huge, and efficient is low, and the entry of including is limited, lacks ageing simultaneously.

In order to address this problem, people begin to explore how to extract bilingual dictionary automatically from network in recent years.

The various forms of bilingual document that exists on the internet can be summed up as three classes:

Paragraph contrast type: this forms with one section corresponding target Chinese language is originally staggered the document of this form by one section source Chinese language usually, and such document mostly is bilingual parallel document;

The form type: be made up of a lot of row usually in this document, each row comprises the entry of two bilingual journals, and this document has more in the file of more present bilingual nomenclature classes;

The plain text type, this document generally is made up of the bilingual text that is mingled with mutually, does not have what rule.

For first kind of bilingual document of paragraph contrast type, mainly be to utilize co-occurrence frequency between source words and phrases and the target word, the probability of translation each other occurs between information decision bilingual term such as position and length in the prior art.Wherein, the research of from parallel language material, extracting bilingual dictionary abroad with the domestic broad research that all obtained, the method that much can use for reference is arranged.It is different with extraction bilingual dictionary from parallel language material to extract bilingual dictionary from non-parallel language material, can't directly utilize the above-mentioned statistical information between the speech this moment, because there is not the unit that checks one against another in the non-parallel language material, therefore just there are not related informations such as complete believable co-occurrence frequency yet.Extracting bilingual term from non-parallel language material roughly exists: several methods such as " context heterogeneity " method, word relational matrix method, partial parallel document process method, contextual feature Furthest Neighbor.

This class Technology Need was collected parallel language material from the internet, and the extraction efficiency that utilization is added up, philological technology is carried out bilingual term is not high, introduce noise easily.

It is right to exist a lot of second kind of bilingual document of form type and the bilingual document speech of the third plain text type at present on network, these speech are to having some features, for example: the bilingual speech of row type is right, lists with promptly becoming a piece of writing on the page, the left side is English or Chinese, and the right is the Chinese or the English of its intertranslation; The bilingual speech of bracket type is right, and promptly utilize bracket to show the intertranslation relation: the English in the bracket is the translation that bracket is close to Chinese outward.These entry quantity are many, the frequency of occurrences is high, entry quality height, pattern are fixed, and therefore are easy to extract and handle.

Existing technology according to the feature of these speech to distributing, is aided with the verification of local dictionary mostly by coding, and such entry is extracted from the internet.These technology can automatically extract bilingual term from webpage, but the form that can handle is fairly simple, so the extraction ability is more limited.

Summary of the invention

The invention provides a kind of method and apparatus that utilizes structure of web page to extract bilingual term, to realize efficient extraction to bilingual term in the webpage.

The invention provides a kind of method of utilizing structure of web page to extract bilingual term, comprising:

In search engine, search for related web page and preservation according to default seed entry;

Extract the form that described seed entry occurs in described webpage, and in described webpage, extract other bilingual term that have same format with described seed entry.

The default seed entry of described basis is searched for related web page and is preserved and comprises in search engine:

Utilize default bilingual vocabulary as initial seed speech tabulation, wherein seed entry is sent to search engine, obtain the webpage of search engine searches and preserve described webpage and link accordingly;

Document is downloaded in link according to described preservation, saves as local html file.

The corresponding link of the described webpage of described preservation comprises: the reason of reforming is gone in the link that repeats, preserve the link of going to reform after managing;

Described saving as after the local html file also comprises:

Preserve seed speech and the corresponding corresponding relation of downloading between the document.

Extract the form that described seed entry occurs in described webpage, and extraction comprises with other bilingual term that described seed entry has same format in described webpage:

Set up corresponding tag tree according to described webpage;

Travel through described tag tree, the structural species child node is obtained the described kind of nearest public father node that each child node in the child node logarithm group is right to array; Described kind of child node is to comprising the first seed entry place node and the second seed entry place node in the described seed entry, and the described first seed entry and the second seed entry are respectively the seed entry of different language;

Seek and other parallel nodes of described nearest public father node, obtain child's tabulation of described other nodes;

Travel through each node in described child's tabulation, extract wherein bilingual term and storage.

Describedly set up corresponding tag tree according to described webpage and comprise:

Resolve described html file, described html file is converted into tree structure corresponding; In the described tree construction with＜HTML the root node of the corresponding tree of label, other labels and text are arranged according to the nest relation in described html file as the child node of tree;

Described tree construction comprises the parallel construction of different subtrees and the parallel construction of identical subtree.

Described structural species child node comprises array:

The first seed entry place node in the seed entry described in the described tag tree is right as seed speech node with the second seed entry place node nearest apart from this node, be stored in the described kind of child node logarithm group.

Before described searching other nodes parallel, also comprise with described nearest public father node:

Extract position and storage that described first seed entry and the described second seed entry occur in affiliated node, set up the corresponding relation of described first seed entry and the described second seed entry position;

Whether the corresponding relation of judging described first seed entry and the described second seed entry position satisfies preset condition, when satisfying, carries out the step of seeking other nodes parallel with described nearest public father node.

Each node in the described child's tabulation of described traversal, the bilingual term and the storage of extracting wherein comprise:

When described node is non-text node, handle next child nodes;

Whether the length of judging described text node is less than the preset multiple of seed entry total length; Otherwise, handle next child nodes;

Judge whether described text node mates bilingual pattern; If coupling is then extracted wherein first language part and second language part;

Judge whether described first language part and second language part comprise the character of needs, if comprise, then described first language part and second language part are preserved as a pair of candidate's bilingual term, preserve the positional information that it occurs simultaneously in webpage, handle next child nodes;

After all child nodes dispose, if satisfy interstitial content as candidate's bilingual term, empty all candidate's bilingual term of collection described other nodes under less than preset number, handle other nodes.

The invention provides a kind of equipment that utilizes structure of web page to extract bilingual term, comprising:

The Webpage search unit is used for according to default seed entry in search engine search related web page and preservation;

The entry extraction unit is used for extracting the form that described seed entry occurs at described webpage, and extracts other bilingual term that have same format with described seed entry in described webpage.

Described Webpage search unit specifically is used for:

Described Webpage search unit also is used for:

The reason of reforming is gone in the link that repeats, preserve the link of going to reform after managing;

Described entry extraction unit specifically is used for:

Set up corresponding tag tree according to described webpage;

Described entry extraction unit specifically is used for:

Described entry extraction unit also is used for:

Whether the corresponding relation of judging described first seed entry and the described second seed entry position satisfies preset condition.

Described entry extraction unit also is used for:

When described node is non-text node, handle next child nodes;

Compared with prior art, the present invention has the following advantages at least:

Among the present invention, in search engine, search for related web page and preservation by utilizing default seed entry, extract the form that the seed entry occurs then in the webpage that searches, and in the webpage that searches, extract other bilingual term that have same format with the seed entry, thereby improve the extraction efficiency of bilingual term in the webpage.

Description of drawings

Fig. 1 is that the structure of web page that utilizes that the embodiment of the invention provides extracts the schematic flow sheet of the method for bilingual term;

Fig. 2 is that the language material acquisition module carries out the schematic flow sheet that language material is gathered in the embodiment of the invention;

Fig. 3 is the html tag tree of the html document correspondence in the embodiment of the invention table 3;

Fig. 4 is the html tag tree of the html document correspondence in the embodiment of the invention table 4;

Fig. 5 is the html tag tree of the html document correspondence in the embodiment of the invention table 5;

Fig. 6 is that the bilingual term abstraction module is analyzed webpage in the embodiment of the invention, therefrom extracts the process synoptic diagram of bilingual journal entry;

Fig. 7 utilizes bilingual term and structure of web page that webpage is analyzed in the process shown in Figure 6, extract the process synoptic diagram of bilingual journal entry wherein;

Fig. 8 is the extraction process synoptic diagram of identical subtree parallel construction bilingual term in the embodiment of the invention;

Fig. 9 is the extraction process synoptic diagram of different subtree parallel construction bilingual term in the embodiment of the invention;

Figure 10 utilizes structure of web page to extract the structural representation of the equipment of bilingual term in the embodiment of the invention.

Embodiment

Embodiments of the invention utilize a bilingual vocabulary as the tabulation of initial seed speech, and wherein seed speech is sent to search engine, and the webpage that the decimated search engine returns is also preserved.The right candidate web pages of a certain seed speech to extracting in the search engine is extracted the form of seed speech to occurring in webpage then, and extraction is right to other the bilingual speech with same format with the seed speech in webpage.The bilingual speech of other that will extract from webpage is right, and to adding in the tabulation of seed speech, it is right further to extract new bilingual speech from the internet, extracts thereby form iteration as new seed speech.

The embodiment of the invention provides a kind of method of utilizing structure of web page to extract bilingual term, as shown in Figure 1, comprising:

Step 101 is searched for related web page and preservation according to default seed entry in search engine;

Step 102 is extracted the form that described seed entry occurs in described webpage, and extracts other bilingual term that have same format with described seed entry in described webpage.

In the embodiments of the invention, the system that extracts the bilingual term in the webpage mainly is made of two functional modules: language material acquisition module and bilingual term abstraction module.Wherein, the language material acquisition module is born the collecting work that extracts the required language material of bilingual term, carries out the language material collection according to default seed entry; The bilingual term abstraction module is analyzed the language material of language material acquisition module collection, extracts the bilingual term in the language material; Then, the language material acquisition module as new seed entry, and further carries out the language material collection according to new seed entry with the bilingual term that is drawn into, thereby can iteration obtain bilingual term.

With extraction process English-Chinese, the Chinese-English bilingual entry the language material acquisition module of system in the method that the embodiment of the invention provides and the function of bilingual term abstraction module are described respectively as an example below, certainly, during the bilingual term extraction technique that provides of the embodiments of the invention bilingual term that is equally applicable to other kinds extracts.

At first introduce the language material acquisition module.

Extract bilingual term from the internet, the webpage that need might comprise bilingual term from the internet is collected, and the language material acquisition module is collected the webpage that may comprise bilingual term by search seed speech in search engine.Specifically as shown in Figure 2, may further comprise the steps:

Step 201 is obtained some bilingual journal words, tabulates as the seed speech that language material is gathered.

In the embodiment of the invention, set up the tabulation of seed speech according to actual needs in advance, comprise in this seed speech tabulation that the some groups of seed speech of being chosen by the user are right, as shown in table 1:

Table 1

Wherein, the English-Chinese entry of the corresponding one group of contrast of every row, for example " Base rate " is corresponding to " benchmark interest rate "; The one or more Chinese term that comprises an english term (as Affiliated company, Base rate) and correspondence in every row is (as affiliated company; Associated company).Wherein, separate with character " | " between primitive term and target language term, between a plurality of targets language terms with character "; " separate.

Step 202, language material acquisition module are utilized the seed entry in the tabulation of search engine searches seed speech, obtain the document links of the webpage that searches.

Concrete, the language material acquisition module sends to search engine according to the seed entry structure search engine inquiry instruction in the tabulation of seed speech, and downloads search results pages, extracts the document links and the storage of this webpage from search results pages, obtains the document links collection.

Step 203 is gone reformation reason and storage to the document links collection.

Because search engine might obtain identical Search Results to different seed speech, obtain identical search results pages, therefore, therefore the document links that the language material acquisition module can obtain to repeat by step 202 needs further the document links that repeats to be gone the reason of reforming.

Step 204 according to the document links that extracts, is downloaded document, saves as local html file.

Step 205 is saved in the corresponding relation between seed speech and the html file in the log file, as the input of bilingual term abstraction module.

For the several seed speech in the table 1, after the processing of language material capture program and downloading related web page, the information shown in the table 2 is arranged in the log file that obtains:

Table 2

In log file, the html file that the corresponding one group of bilingual seed speech of every row and sign basis this bilingual seed speech related with it are searched for is at the routing information of this locality, represent this routing information with html file example by name in the table 2, the form of bilingual seed speech is the same with form in initial seed speech file, separates with character " | " between corresponding html file name of seed speech and the seed speech.What pay special attention to is, a seed speech might have the html file of a plurality of correspondences, for example the seed speech in the table 2 " Alternative investment| alternative investment " just has the html file of 2 correspondences, and this is because search engine may return a plurality of Search Results to one group of seed speech.Those of ordinary skills are to be understood that, only be to describe bilingual seed speech and the corresponding relation of corresponding HTML between the path of this locality storage in the table 2, also can set up the corresponding relation of bilingual seed speech and html file by other modes with html file example by name.

Introduce the bilingual term abstraction module below.

After the language material acquisition module utilizes search engine to collect this locality from the internet with single or a plurality of webpages of seed word association, the embodiment of the invention is analyzed these webpages by the bilingual term abstraction module, therefrom extract the bilingual journal entry, the structure bilingual dictionary.

As shown in Figure 6, in the embodiment of the invention, the bilingual term abstraction module is analyzed webpage, and the process that therefrom extracts the bilingual journal entry comprises:

Step 601, the bilingual term abstraction module carries out initialization.

Concrete, initialization procedure comprises and is written into either traditional and simplified characters conversion tabulation, full-shape-half-angle conversion table, Greek alphabet, HTML label list relevant with demonstration and English part of speech table etc.

Step 602, handle the log file that the language material acquisition module generates line by line, each row to log file, extract the relative path of the web page files of Chinese entry (, only getting the entry of first branch front) wherein, English entry and bilingual term appearance in this locality for the situation that has a plurality of Chinese entries.

Step 603, the relative path of web page files in this locality that occurs according to bilingual term obtains web page files, utilizes bilingual term and structure of web page that webpage is analyzed, and extracts bilingual journal entry wherein.

Step 604, outputting dual entry, and the positional information in webpage.

Concrete, utilize bilingual term and structure of web page that webpage is analyzed in the step 603, extract the process of bilingual journal entry wherein, as shown in Figure 7, specifically comprise:

Step 701, html text and the pre-service of seed speech.

Concrete, above-mentioned pre-service comprises for to carry out the processing of html text and seed speech according to the various tabulations that are written in the step 601:

1) replacing the HTML escape character is respective symbols: HTML escape character “ ﹠amp for example; #lt; ， ﹠amp; #nbsp; " the special symbol of expression in HTML, and when extracting bilingual term, need “ ﹠amp; #lt; ， ﹠amp; #nbsp; " replace with its corresponding characters "＜" and " ";

2) full-shape-half-angle conversion (replace with " a " as " a ", " 1 " replaces with " 1 " etc.);

3) handle non-Chinese character double-byte characters (replacing) with special symbol;

4) unsimplified Hanzi is converted to simplified Hanzi;

5) remove continuous space;

6) the useless html tag of deletion.These labels often only determine the demonstration of text, as＜b 〉,＜small〉etc., the html tag that can delete that part is useless is as follows:

B	i	u	s	strike
					Strong	em	big	small	font

Step 702, the tag tree of structure html file in internal memory.

In the embodiment of the invention, the html file for preserving can be converted into tree structure corresponding by parsing, wherein＜HTML〉the corresponding root node of setting of label, other labels and text are arranged according to the nest relation in html file as the child node of tree.

The html file that has the bilingual journal form in the embodiment of the invention mainly comprises the structure of web page of following two types of tree structures: the parallel construction of different subtrees, the parallel construction of identical subtree.This webpage language material of two types has good form, is suitable for the automatic analysis of program, and resulting bilingual journal entry also relatively accurately.

Wherein, the parallel construction of different subtrees is meant: the bilingual journal entry is distributed in html tag and sets in the parallel construction of different subtrees.In this structure, the subtree at a group node corresponding with the seed speech (comprise the node of English entry and comprise node general designation kind of the child node of Chinese entry) place is the subtree separate, that structure is identical with the subtree at other bilingual term place in the webpage.

The example of the different subnumber parallel constructions shown in following table 3:

Table 3

Fig. 3 is the html tag tree of the html document correspondence in the table 3, and in this example, the bilingual journal entry is distributed in the same text leaf node utilization＜font in the HTML tree〉the mark font, and with html tag＜p segmentation.

Be the example of another different subnumber parallel constructions shown in the table 4:

Table 4

Fig. 4 is the html tag tree of the html document correspondence in the table 4, and wherein, the bilingual journal entry is distributed in the different text leaf nodes in the HTML tree, organizes with the form of form.

These two examples have individual common ground: the bilingual term in the webpage is in the different subtrees that are distributed in the html tag tree, and such structure is called the parallel construction of different subtrees.

Introduce the parallel construction of identical subtree below.Different with the parallel construction of different subtrees, the parallel construction of identical subtree refers to that the bilingual journal entry is distributed in the identical subtree of html tag tree.In this structure, the subtree at a group node place corresponding with the seed speech, with the subtree at other bilingual term place in the webpage be same subtree.Table 5 is examples of bilingual term tabulation method for expressing in html document of the parallel construction of this identical subtree:

Table 5

Fig. 5 is the html tag tree of the html document correspondence in the table 5, by this example as can be seen, the bilingual journal entry of identical subtree parallel construction is distributed in the same text leaf node in the HTML tree, other bilingual journal entry is distributed in the different text nodes under the same subtree, and such structure is the parallel construction of identical subtree.

Step 703 obtains the coded message (if not this property value, be defaulted as the GBK coding) of document by the charset attribute of HTML, and the coding unification of html document is converted to the GBK coding.

Step 704, traversal html tag tree obtains the node that comprises the seed entry and stores corresponding seed entry node array respectively into; If in the html tag tree, fail to find the node that comprises the seed entry, then extract failure, handle next webpage.

Because the seed entry may occur repeatedly in html file, for example in the html tag tree, might repeatedly occur for English-Chinese seed entry, but do not occur simultaneously, therefore, need to pass through respectively the node at stored digital English, Chinese seed entry place, the node at this English, Chinese seed entry place is called kind of a child node; In array, write down simultaneously the position of these kinds child node in tag tree respectively.Concrete, in conjunction with shown in Figure 5, can different position coordinates signs can be set respectively for root node and other nodes of html tag tree, thus can be by the position of coordinates logo record kind of child node in the html tag tree of record kind of child node.Certainly, also can write down the position of kind of child node the html tag tree by the title of each node on the link from root node to kind of child node in the record html tag tree.Do not limit for record kind of the position of child node in tag tree in the embodiment of the invention.

Step 705 to each the English kind child node in the English kind child node array, is sought in Chinese kind child node array and its nearest node in position in the html tag tree, and the component species child node is right, stores seed speech node into to array.

In the embodiment of the invention, consider that same seed entry may occur repeatedly in html file, as preferred mode, the corresponding translation of Chinese seed entry conduct entry that will be nearest with English seed entry position in the html tag tree, therefore, will with English kind child node nearest Chinese node in position in html tag tree, to form seed node right with this English seed node, with this kind child node to language material as the extraction bilingual term; When being a plurality of with the nearest Chinese seed entry in English seed entry position in html tag tree, it is right that then corresponding respectively that English kind child node and position in the html tag tree is nearest a plurality of Chinese nodes are formed the seed nodes.Plant child node to being stored in corresponding seed speech node in the array.

Step 706, to seed speech node to each the seed speech node in the array to handling successively, extract bilingual term.

Concrete, according to seed speech node each the seed speech node in the array is comprised extracting bilingual term:

1) the nearest father node of extraction English kind child node and Chinese kind child node.

If point to same node, then this node as nearest father node; If extract nearest father node failure, then return; If recently father node is bookmark (promptly＜a 〉), then return.

2) extract English (Chinese) seed entry and plant the position that occurs in the child node at English (Chinese), because the seed entry may occur repeatedly, therefore utilize the position of an array (being called the position array) storage English (Chinese) seed speech in English (Chinese) kind child node in the subtree that with kind of child node is root.

3) owing to English, Chinese seed speech might repeatedly occur in seed speech node, therefore before further handling, need put array and match by English, Chinese seed lexeme that back obtains, obtain the position to (comprise English seed lexeme put put) array: if English, Chinese seed speech node are identical with Chinese seed lexeme, English seed lexeme is put each position in the array, put the searching position nearest with it in the array in Chinese seed lexeme, it is right to form the position; If English, Chinese seed speech node difference in order to ensure the one-to-one relationship between English-Chinese, require the size of English-Chinese position array to be 1, thus right by position of the structure of the element in the array of English-Chinese position;

4), handle the bilingual parallel entry of two types of the parallel constructions of the parallel construction of identical subtree, different subtrees successively if English, Chinese seed speech node are identical; As if English, Chinese seed speech node difference, only extract the bilingual parallel entry of different subtree parallel construction types.

Introduce the extraction of the identical subtree parallel construction and the bilingual term of different subtree parallel construction types below respectively.

The extraction of identical subtree parallel construction bilingual term is input as the nearest father node of kind of child node and the position of seed entry in the node subtree to array, is output as the bilingual term that extracts from html file.As shown in Figure 8, may further comprise the steps:

Step 801, right to each position in the array of position, judge whether to satisfy pre-conditioned successively: whether be identical position (the identical subtree parallel construction requires English-Chinese entry to be distributed in the same text node); Whether length is less than 5 times (can not be long) of seed speech total length; Whether mate English-Chinese, Chinese-English pattern; If do not exist the position of satisfying above-mentioned condition right, then return repeated execution of steps 801;

Step 802 compares nearest father node and the identical subtree parallel construction node of having handled, if handled this node, then returns step 801, to avoid repeating extraction;

Step 803 is sought and nearest other parallel nodes of father node; Here parallel node is meant: the node from the path of root node to two node has identical tag name;

Step 804 is handled successively to its parallel node in nearest father node and the webpage.

Concrete, this processing comprises:

A, with this node and the identical subtree parallel construction node handled relatively if handled this node, then returns, and repeats to extract avoiding;

B, obtain the child tabulation of this node, if child's number returns (tabulation can not be too little) less than 16;

C, traversal child tabulation, to each node in the tabulation:

If the D label node is then handled next child nodes (promptly only handling text node);

The length of E, text node need be less than 6 times (can not be long) of seed speech total length, otherwise invalid node number adds 1, handles next child nodes;

F, judge whether text node mates English-Chinese, Chinese-English pattern, if extract English part wherein, Chinese part;

G, English part, the Chinese part extracted are judged: if English part does not have English character, the Chinese part does not have Chinese character, perhaps comprise network address, mailbox, then invalid node number adds 1, handles next child nodes; Otherwise further remove the part of speech in the English, if English part is empty, invalid node number adds 1, handles next child nodes;

H, English, Chinese part is preserved as a pair of candidate's bilingual term that extracts, preserve the positional information that it occurs simultaneously in webpage; Effectively the node number adds 1, handles next child nodes;

After I, all child nodes disposed, if effectively the node number is less than 8, perhaps invalid node number then emptied all candidate's bilingual term of collecting under this node greater than effective node number, judges that this tabulation is not to tabulate effectively;

J, this node is added in the processing node tabulation, prevent to repeat to extract.

Step 805 is returned the bilingual term of extraction.

Nearest father node that is input as kind of child node that different subtree parallel construction bilingual term extract and the position of seed entry in the node subtree are output as the bilingual term that extracts to array from html file.Shown in Figure 9 is the detailed process that different subtree parallel construction bilingual term extract, and comprising:

Step 901, right to each position in the logarithm group of position, judge whether identical English, Chinese seed lexeme put:

A) if identical, show that English-Chinese seed speech is distributed in the same text node, judge that length is whether less than 5 times (can not be long) of seed speech total length; Whether mate English-Chinese, Chinese-English pattern;

B) if different, show that English-Chinese seed speech is distributed in the different text nodes, judge length whether long (greater than English, Chinese seed speech length 5 times) respectively, and the pattern of whether mating English, the Chinese;

C), then return empty tabulation if do not exist the position of satisfying above-mentioned condition right;

Step 902 compares nearest father node with the different subtree parallel construction nodes of having handled,, then return (avoiding repeating extracting) if there is node of the same type; Here two of the same type being meant of node: the node from the path of root node to two node has identical tag name, and two nodes have identical sub-tree structure (only require the tag name unanimity when comparing the subtree node, ignore the content of property value and text node);

Step 903 is sought and nearest father node other nodes of the same type; Here node of the same type is meant: the node from the path of root node to two node has identical tag name, and two nodes have identical sub-tree structure (only require the tag name unanimity when comparing the subtree node, ignore the content of property value and text node);

Step 904 is if node number of the same type returns less than 8;

Step 905 is handled successively to node of the same type in nearest father node and the webpage, and here in two kinds of situation: English-Chinese seed speech is distributed in the same text node, and English-Chinese seed speech is distributed in the different text nodes.

When English-Chinese seed speech is distributed in the same text node, node of the same type in nearest father node and the webpage handled successively comprises:

A, according to the relative path of English-Chinese seed speech in nearest father node, extracting respectively with this node is the node in this path in the subtree of root;

B, judge whether this node mates English-Chinese, Chinese-English pattern, and the length of this node need be less than 6 times (can not be long) of seed speech total length, otherwise invalid node number adds 1, handles next node of the same type;

C, extraction English part, Chinese part wherein

D, English part, the Chinese part extracted are judged: if English part does not have English character, the Chinese part does not have Chinese character, perhaps comprise network address, mailbox, then invalid node number adds 1, handles next node of the same type;

E, remove the part of speech in the English, if English part be a sky, invalid node number adds 1, handles next node of the same type;

F, English, Chinese part is preserved as a pair of candidate's bilingual term that extracts, preserve the positional information that it occurs simultaneously in webpage; Effectively the node number adds 1, handles next node of the same type.

When English-Chinese seed speech is distributed in the different text nodes, node of the same type in nearest father node and the webpage handled successively comprises:

G, according to the relative path of English-Chinese seed speech in nearest father node, extracting respectively with this node is the node (comprising English node, Chinese node) in this path in the subtree of root;

The length of H, the civilian text node of English (Chinese) need be less than 6 times (can not be long) of the civilian seed speech of English (Chinese) total length, otherwise invalid node number adds 1, handles next node of the same type;

I, English node, Chinese node are judged: if English part does not have English character, the Chinese part does not have Chinese character, perhaps comprise network address, mailbox, then invalid node number adds 1, handles next node of the same type;

J, remove the part of speech in the English, if English part be a sky, invalid node number adds 1, handles next node of the same type;

K, English, Chinese node are preserved as a pair of candidate's bilingual term that extracts, preserved the positional information that it occurs simultaneously in webpage; Effectively the node number adds 1, handles next node of the same type.

Step 906, after all node processing of the same type finished, if effectively the node number is less than 8, perhaps invalid node number then emptied all candidate's bilingual term that different subtree parallel constructions are collected greater than effective node number;

Step 907 is added this node in the processing node tabulation to, prevents to repeat to extract;

Step 908 is returned the bilingual term of extraction.

In the automatic leaching process that the embodiment of the invention provides, determined well that by analysis and pattern match the speech that need extract is right to structure of web page, but because the invention example has adopted loose matching strategy, there is not the grammatical and semantic analysis process, so have the puppet coupling of some, rather than genuine speech is right.

So to the entry that from webpage, extracts, need examine, getting rid of according to certain rule (length/content) is not the right situation of speech.But can see that though the part speech is very convenient for artificial check and correction to not realizing accurately alignment, through a spot of work, it is right just can to obtain final available high-quality measure word.

By adopting method provided by the invention, utilize default seed entry in search engine, to search for related web page and preservation, extract the form that the seed entry occurs then in the webpage that searches, and in the webpage that searches, extract other bilingual term that have same format with the seed entry, thereby improve the extraction efficiency of bilingual term in the webpage.

The embodiment of the invention provides a kind of equipment that utilizes structure of web page to extract bilingual term, as shown in figure 10, comprising:

Webpage search unit 10 is used for according to default seed entry in search engine search related web page and preservation;

Entry extraction unit 20 is used for extracting the form that described seed entry occurs at described webpage, and extracts other bilingual term that have same format with described seed entry in described webpage.

Described Webpage search unit 10 specifically is used for:

Described Webpage search unit 10 also is used for:

Described entry extraction unit 20 specifically is used for:

Set up corresponding tag tree according to described webpage;

Described entry extraction unit 20 specifically also is used for:

Described entry extraction unit 20 also is used for:

When described node is non-text node, handle next child nodes;

In the embodiment of the invention, preferred, described entry extraction unit 20 specifically is used for:

Html text and seed speech are carried out pre-service: comprise the replacement of HTML escape character, double byte character conversion, complex form of Chinese characters conversion, web page coding conversion etc.;

In internal memory, make up the tag tree of html file; Traversal html tag tree obtains the node that comprises English (Chinese) seed entry and stores corresponding English (Chinese) node array respectively into; The structural species child node is to array;

Seed speech node is handled (comprising an English kind child node and a Chinese kind child node) successively to each the seed speech node in the array, extracted the bilingual parallel entry of two types of the parallel constructions of the parallel construction of identical subtree, different subtrees;

Outputting dual entry, and the positional information in webpage.

By adopting equipment provided by the invention, utilize default seed entry in search engine, to search for related web page and preservation, extract the form that the seed entry occurs then in the webpage that searches, and in the webpage that searches, extract other bilingual term that have same format with the seed entry, thereby improve the extraction efficiency of bilingual term in the webpage.

Through the above description of the embodiments, those skilled in the art can be well understood to the present invention and can realize by the mode that software adds essential general hardware platform, can certainly pass through hardware, but the former is better embodiment under a lot of situation.Based on such understanding, the part that technical scheme of the present invention contributes to prior art in essence in other words can embody with the form of software product, this computer software product is stored in the storage medium, comprise that some instructions are with so that a computer equipment (can be a personal computer, server, the perhaps network equipment etc.) carry out the described method of each embodiment of the present invention.

It will be appreciated by those skilled in the art that accompanying drawing is the synoptic diagram of a preferred embodiment, module in the accompanying drawing or flow process might not be that enforcement the present invention is necessary.

It will be appreciated by those skilled in the art that the module in the device among the embodiment can be distributed in the device of embodiment according to the embodiment description, also can carry out respective change and be arranged in the one or more devices that are different from present embodiment.The module of the foregoing description can be merged into a module, also can further split into a plurality of submodules.

The invention described above embodiment sequence number is not represented the quality of embodiment just to description.

More than disclosed only be several specific embodiment of the present invention, still, the present invention is not limited thereto, any those skilled in the art can think variation all should fall into protection scope of the present invention.

Claims

1. a method of utilizing structure of web page to extract bilingual term is characterized in that, comprising:

2. the method for claim 1 is characterized in that, the default seed entry of described basis is searched for related web page and preserved and comprises in search engine:

3. method as claimed in claim 2 is characterized in that,

Described saving as after the local html file also comprises:

4. the method for claim 1 is characterized in that, extracts the form that described seed entry occurs in described webpage, and extraction comprises with other bilingual term that described seed entry has same format in described webpage:

Set up corresponding tag tree according to described webpage;

5. method as claimed in claim 4 is characterized in that, describedly sets up corresponding tag tree according to described webpage and comprises:

6. method as claimed in claim 4 is characterized in that, described structural species child node comprises array:

7. method as claimed in claim 4 is characterized in that, before described searching other nodes parallel with described nearest public father node, also comprises:

8. method as claimed in claim 4 is characterized in that, each node in the described child's tabulation of described traversal, and the bilingual term and the storage of extracting wherein comprise:

When described node is non-text node, handle next child nodes;

9. an equipment that utilizes structure of web page to extract bilingual term is characterized in that, comprising:

10. equipment as claimed in claim 9 is characterized in that, described Webpage search unit specifically is used for:

11. equipment as claimed in claim 10 is characterized in that, described Webpage search unit also is used for:

12. equipment as claimed in claim 9 is characterized in that, described entry extraction unit specifically is used for:

Set up corresponding tag tree according to described webpage;

13. equipment as claimed in claim 12 is characterized in that, described entry extraction unit specifically is used for:

14. equipment as claimed in claim 12 is characterized in that, described entry extraction unit also is used for:

15. equipment as claimed in claim 12 is characterized in that, described entry extraction unit also is used for:

16. equipment as claimed in claim 12 is characterized in that, described entry extraction unit also is used for:

When described node is non-text node, handle next child nodes;