CN101441648B

CN101441648B - Method and system based on webpage characteristic abstraction text

Info

Publication number: CN101441648B
Application number: CN2008101770713A
Authority: CN
Inventors: 李允炫; 金圭一; 朴振洙
Original assignee: NHN Corp
Current assignee: NHN Corp
Priority date: 2007-11-21
Filing date: 2008-11-19
Publication date: 2011-12-14
Anticipated expiration: 2028-11-19
Also published as: KR20090052757A; JP4907635B2; CN101441648A; JP2009129456A; KR100958934B1

Abstract

The invention provides a method and a system for extracting test based on characteristic of web page, and a computer-readable recording medium for executing the method. In particular, one embodiment of the invention relates to a method for extracting test comprising: recognizing text indication point on web page; confirming corresponding information that at least a part of Identifier of the web page corresponds with stored text extracting range; determining the text extracting range based on the indication point information and the corresponding information of the confirmed text extracting range.

Description

Method and system based on webpage characteristic abstraction text

Technical field

The present invention relates to a kind of method and system based on webpage characteristic abstraction text, more particularly, the present invention relates to a kind ofly when after the text of extracting out in the webpage is provided, utilizing the text that text baseds service such as sound mapping or translation is provided, can extract the recording medium of method, system and embodied on computer readable of the text of different range such as word, sentence, paragraph and full text according to webpage characteristic out.

Background technology

In recent years, along with using popularizing of the Internet, can obtain diversified information by the Internet.In order to satisfy requirements of different users, provide the enterprise of Internet service that diversified service is being provided by the website, its type service also is the gesture that increases progressively day by day.

The Internet user contacts the service that these enterprises provide in a variety of forms, particularly by the website obtain news information, dictinary information, specialized information, diversified internet contents such as domain information, shopping information.

These users are in order to obtain own required content, by retrieving and when particular webpage obtains required content, generally be this content that mainly constitutes with textual form with the eyes textual research and explain in the website.But,, only utilize these contents that mainly provides, being not being met now of multimedia era with textual form from user's position.In fact, along with the continuous increase of the quantity of information that comprises in the webpage, when the whole content that provides with textual form with the eyes textual research and explain as the user also was provided, sight line can not be left the problem of the such display device of the monitor of computing machine for example.And Yi Bian also exist multitasking (multitasking) person who requires one side to obtain information needed from content, carry out other work among the user, these requirements also are not being met.

In addition, in recent years, CTI (Computer TelephonyIntegration such as VoIP (Voice over IP) technology, voice recognition technology, sound mapping technology, voice synthesis, automatic answering system, computer telephone integration) technology has obtained a lot of concerns, and we also can obtain the further Internet service that the user sends indication, exchanges with the sound acquired information, with sound with sound by these technology at expectation in internet environment.

Therefore, in the problem that exists when the content based on text is provided in solution, use the CTI technology more widely, developed TTS (Text To Speech) technology.The TTS technology is can more widely used technology than voice recognition technology, is a kind of various text messages to be transformed into the user-centered interface technology that sound provides.TTS technology on the webpage mainly is to realize to extract text out and it is transformed into the mode that offers the user behind the sound from webpage.For example, mouse-over (mouse-over) incident that takes place when on a certain position of webpage, mouse being paused certain hour according to the user, extract the word corresponding with mouse indication point position out and be transformed into sound, perhaps user's part of pulling the text on the webpage is transformed into sound with it.

But the TTS that is realizing at present, provide by webpage serves the user-centered interface technology that can not say so perfect.Specifically, exist and the word of operating the position of discerning by user's mouse-over can only be transformed into sound in the present TTS service, perhaps can only make the user directly pull the problem that mouse is specified the text that will be transformed into sound.Under the former situation, exist the problem that word at mouse-over place is not transformed into without exception sound according to user's intention.In addition, in the latter case, the user becomes sound for the text transform with expected range, elder generation roughly understands after the text with eyes, specify the scope of the text that becomes the sound mapping object again, this has not only run counter to the original idea of as far as possible being avoided the user directly to understand the TTS technology of text, and above-mentioned assigned operation also needs the extra time.

Therefore, need following access technique, promptly extract the text of particular range (for example word, sentence, paragraph or scope in full) out,, improve user convenience thus so that various text based services to be provided according to user's intention and according to the characteristic of webpage.

Summary of the invention

The objective of the invention is to, a kind of method and system of extracting text according to the characteristic of webpage on one's own initiative out are provided.

And, the objective of the invention is to, initiatively extract the text of different range out according to the characteristic of webpage, web page user is obtained easily from the data of text conversion.

In addition, another object of the present invention is to, when the user extracts the text of relative broad range out in webpage, reduce the inconvenience sense that pulls mouse one by one, extract the text of required scope automatically out, thereby reduce unnecessary user's operation according to the characteristic of webpage.

In order to realize aforesaid purpose of the present invention, it is representational composed as follows.

A kind of technical scheme of the present invention is based on the method for webpage characteristic abstraction text, and this method comprises: the step of the text indication point on the identification webpage; Confirm step with the relevant information of the text extraction scope of the corresponding storage of at least a portion of the identifier of above-mentioned webpage; Based on the relevant information of above-mentioned text indication point information and above-mentioned confirmed text extraction scope, the decision text is extracted the step of scope out; Extract the above-mentioned step that has determined the text of scope out.

Another technical approach of the present invention is based on the method for webpage characteristic abstraction text, and this method comprises: the step of the text indication point on the identification webpage; Whether affirmation is storing the step that the corresponding text of at least a portion in the identifier with above-mentioned webpage is extracted the relevant information of scope out in text extract information database; Receive above-mentioned text and extract the step of the relevant information of scope out; Based on the relevant information of above-mentioned text indication point information and the above-mentioned text extraction scope that receives, the decision text is extracted the step of scope out; Extract the above-mentioned step that has determined the text of scope out.

Another technical scheme that the present invention relates to is that text transform is become sound method, and this method also comprises the step that generates the voice data that is associated with the text of extracting out according to the method described above.

A kind of system that the present invention relates to based on webpage characteristic abstraction text, it comprises: text indication point identification part, the text indication point on the identification webpage; Text is extracted the range information confirmation unit out, confirms the relevant information of the text extraction scope of the corresponding storage of at least a portion in the identifier with above-mentioned webpage; Text is extracted the scope determination section out, and based on the relevant information of above-mentioned text indication point information and above-mentioned confirmed text extraction scope, the decision text is extracted scope out; The text extraction unit is extracted the text of above-mentioned determined scope out.

Another technical scheme that the present invention relates to relates to the system based on webpage characteristic abstraction text, comprising: text extract information database; Text indication point identification part, the text indication point on the identification webpage; Text is extracted the range information acceptance division out, confirms whether to store in the above-mentioned text extract information database relevant information that the text corresponding with at least a portion of the identifier of above-mentioned webpage extracted scope out, when not storing, receives the relevant information that text is extracted scope out; Text is extracted the scope determination section out, and based on the relevant information of above-mentioned text indication point information and the above-mentioned text extraction scope that receives, the decision text is extracted scope out; The text extraction unit is extracted the text of above-mentioned determined scope out.

The another kind of technical scheme that the present invention relates to is the system that text transform is become sound, and it comprises: text indication point identification part, the text indication point on the identification webpage; Text is extracted the range information confirmation unit out, confirms the relevant information with the text extraction scope of the corresponding storage of at least a portion of the identifier of above-mentioned webpage; Text is extracted the scope determination section out, and based on the relevant information of above-mentioned text indication point information and above-mentioned confirmed text extraction scope, the decision text is extracted scope out; The text extraction unit is extracted the text of above-mentioned determined scope out; The voice data generating unit generates the voice data that is associated with the text of above-mentioned extraction.

The another kind of technical scheme that the present invention relates to is the system that text transform is become sound, and it comprises: text indication point identification part, the text indication point on the identification webpage; Text is extracted the range information acceptance division out, whether affirmation is storing the relevant information that the text corresponding with at least a portion of the identifier of above-mentioned webpage extracted scope out in above-mentioned text extract information database, if do not store, then receive the information of the scope of extracting out about above-mentioned text; Text is extracted the scope determination section out, and based on the relevant information of above-mentioned text indication point information and the above-mentioned text extraction scope that receives, the decision text is extracted scope out; The text extraction unit is extracted the text of above-mentioned determined scope out; The voice data generating unit generates the voice data that is associated with the text of above-mentioned extraction.

In addition, the present invention also provides the recording medium of the embodied on computer readable of the computer program that a kind of other method, system and having write down based on webpage characteristic abstraction text be used to carry out said method.

According to the present invention, extract text on one's own initiative out according to webpage characteristic, text based services such as sound mapping service or translation service are provided based on this, make the user not need a lot of operations, just can obtain meet customer requirements based on text data.

And, according to the present invention, when the user utilizes webpage under not knowing the situation of webpage characteristic, also can extract the text of the scope that meets this characteristic automatically out, make the user can grasp the content of representing in the webpage effectively.

In addition,, when the user will extract the text of wider range out on webpage, can eliminate the inconvenience sense that the user need pull full text, can prevent that the text that the error when pulling because of mouse causes from extracting mistake out according to the present invention.

Description of drawings

Fig. 1 is that the text that expression one embodiment of the invention relate to is extracted the figure of the general configuration of system out;

Fig. 2 a is the figure of the detailed structure of the subscriber computer in the expression text extraction system shown in Figure 1;

Fig. 2 b is the figure of the detailed structure of the TTS server in the expression text extraction system shown in Figure 1;

Fig. 3 is the process flow diagram of representing the extraction text that one embodiment of the invention relates to and the text transform of extracting out being become the process of sound.

Embodiment

Below, the various embodiment that present invention will be described in detail with reference to the accompanying.

The composition of total system:

Fig. 1 is that the text that expression one embodiment of the invention relate to is extracted the figure of the general configuration of system out.

As shown in Figure 1, the text that relates to of one embodiment of the invention is extracted system out and can be comprised subscriber computer 100 and TTS server 300.At this, subscriber computer 100 and TTS server 300 are to communicate by short-haul connections net (LAN) or the telecommunicatio network multiple network environment such as (WAM) that utilizes dedicated line.Such network environment can be known WWW (World Wide Web).In addition, TTS server 300 can carry out two-way communication with more than one subscriber computer 100 by Internet protocol in known network environment.And, this TTS server 300 can according to from the request of subscriber computer 100, and handle with reference to up-to-date extraction range information database 500 and sound mapping database 700.

The formation of subscriber computer:

Fig. 2 a is the figure of the detailed structure of subscriber computer 100 in the expression text extraction system shown in Figure 1.Fig. 2 b is the figure of the detailed structure of expression TTS server 300.

Shown in Fig. 2 a, subscriber computer 100 can comprise operational part 110, text extraction range information database 130, program reservoir 150, user's input part 170, efferent 180 and Department of Communication Force 190.

Operational part 110 can comprise mouse-over identification part 111, extracts range information confirmation unit 112 out, extracts range information request portion 113 out, up-to-date extraction range information request portion 115, extract mode determination section 117, text extraction unit 118 and voice data out portion 119 is provided.According to one embodiment of the invention, mouse-over identification part 111, extract range information confirmation unit 112 out, extract range information request portion 113 out, up-to-date extraction range information request portion 115, extract mode determination section 117, text extraction unit 118 and voice data out and provide in the portion 119 at least one to be comprised in the operational part 110, or the program module that communicates with operational part 110.Such program module can be comprised in the form of management system, application program module and other program module in the operational part 110, physically can be stored in the various known memory storages.And such program module also can be stored in the remote storage that can communicate with operational part 110.Such program module comprises according to the present invention to be carried out particular task described later or realizes routine (routine), subroutine, program, object, assembly (component), data structure of specific abstract data type etc., but and is limited to these.

In addition, operational part 110 as required can with reference to the such webpage identifier of for example URL (Uniform ResourceLocator) corresponding that store, stored with webpage in text (for example extract the relevant information of scope out, information according to webpage characteristic text of which scope of extraction in word, sentence, paragraph and full text) text is extracted range information database 130 out, and the inscape that above-mentioned text extraction range information database 130 can be used as operational part 110 comprises.

In addition, operational part 110 also appends the driven by program portion (not shown) that comprises, so that the program that stores in the driver reservoir 150 simultaneously when the user carries out browser is promptly extracted text out or is utilized the text of extracting out that the program of text based service is provided according to the present invention.Program reservoir 150 does not need certain involved as an element of subscriber computer 100, the known recording medium that can use a computer and can read, and promptly recording mediums such as hard disk, floppy disk, flexible plastic disc, tape, CD-ROM, DVD replace.

User's input part 170 can be common computing machine input mechanism, be keyboard or mouse etc. that efferent 180 can be with the computer monitor of Visual Display browser and/or person's webpage or with the realizations such as loudspeaker of text with voice output.

The composition of server:

In addition, TTS server 300 shown in Figure 2 can provide the TTS service, promptly at least a portion text transform in the webpage be become the server that provides it to the such service of user behind the sound.Such TTS server 300 can be the web page server of the Internet portal website, also can be the management server that the enterprise of TTS service only is provided specially.According to another embodiment of the present invention, TTS server 300 can replace with serving not directly related general networking server with TTS.

The TTS server 300 that one embodiment of the invention relate to can comprise up-to-date extraction range information judging part 310, up-to-date extraction range information obtaining section 330 and TTS transformation component 370.According to one embodiment of the invention, at least a portion in up-to-date extraction range information judging part 310, up-to-date extraction range information obtaining section 330 and the TTS transformation component 370 is comprised in the TTS server 300, or the program module that communicates with TTS server 300.Such program module can be comprised in the form of management system, application program module and other program module in the TTS server 300, physically can be stored in the various known memory storages.And such program module can also be stored in the remote storage that can communicate with TTS server 300.Such program module comprises according to the present invention to be carried out particular task described later or realizes routine (routine), subroutine, program, object, assembly (component), data structure of specific abstract data type etc., but and is limited to these.

As a reference, each element shown in Fig. 1 and Fig. 2 a, Fig. 2 b is interpreted as required receiving and transmitting signal mutually, and still, known communication agency of the present invention about realizing, as to be used to exchange aforesaid signal does not elaborate at this.

The extraction of text and sound mapping:

Fig. 3 is that expression is extracted text out according to one embodiment of the invention and the text transform of extracting out become the process flow diagram of the process of sound.At this, describe the process that becomes sound and output according to the process of text in one embodiment of the invention extraction webpage with the text transform that will extract out in detail with reference to figure 2a, Fig. 2 b and Fig. 3.

When the user utilizes subscriber computer 100 to carry out browser, then extract text out and become the program of voice output to be driven simultaneously the text transform of extracting out according to one embodiment of the invention.This program can be stored in the program reservoir 150 that is contained in subscriber computer 100 inside as described above, also can be stored in other recording medium.

Afterwards, the user can connect the Internet, has the webpage of predetermined URL by the browser access that has started.In addition, a lot of servers provide the content that can read by browser, in order to represent their position, use URL usually.Such URL is used for expressing the document location of each server on the Internet, but URL has the attribute that can more freely set, therefore can comprise the out of Memory (for example, the text that relates to about one embodiment of the invention is extracted the information of scope out) that is used to represent webpage characteristic.In any case, the part of URL or URL can be corresponding with the information relevant with the text extraction scope that the present invention relates to.

With reference to Fig. 3, illustrate that one embodiment of the invention relate to, from webpage, extract text out and output is carried out the process of the data of sound mapping with it.

At first, when the user make the mouse indication point be arranged in by text browser display, that be included in webpage of subscriber computer 100 on the time, at step S310, the mouse-over identification part 111 of operational part 110 judges whether to have taken place the mouse-over incident.

In step S330, extract range information confirmation unit 112 out and judge that text extracts the relevant information that whether exists in the range information database 130 with the text extraction scope of the corresponding storage of URL of current web page out.As briefly mentioning before, text is extracted the information of storing the scope of extracting out about text in the range information database 130 with the URL of webpage accordingly out.The information about text extraction scope like this can store respectively by URL, after also can distinguishing by the several types of corresponding web page, concentrates to store.To this, elaborate again below.

At step S330, if extracting range information confirmation unit 112 out is judged as the text corresponding with the URL of current web page and extracts the relevant information of scope out when not being present in text and extracting range information database 130 out, at step S331, extract range information request portion 113 extracts scope out to the TTS server 300 requests text corresponding with the URL of current web page relevant information out.According to one embodiment of the invention, the up-to-date extraction range information database 500 of TTS server 300 references, be updated periodically and be used to provide the TTS service required various information, promptly about the text of each URL extract out the information of scope with about the relevant information of the type of webpage that provides respectively by URL, so that store the up-to-date information of the scope of extracting out about text.If extract the relevant information that the range information request portion 113 requests text corresponding with the URL of current web page extracted scope out out, then the up-to-date extraction range information obtaining section 330 of TTS server 300 sends up-to-date information to the operational part 110 of subscriber computer 100 with reference to up-to-date extraction range information database 500.

At step S330, be judged as the text corresponding and extract the relevant information of scope out when being present in text and extracting range information database 130 out when extracting range information confirmation unit 112 out with this URL, in step S333, the up-to-date extraction range information request portion 115 of operational part 110 judges that text extracts whether the information that exists in the range information database 130 is up-to-date information out, if not up-to-date information, then send the request that is used for obtaining up-to-date information from TTS server 300.The up-to-date extraction range information judging part 310 of TTS server 300 is according to the request of up-to-date extraction range information request portion 115, with reference to the information that is stored in the up-to-date extraction range information database 500, judge whether the information that exists in the text extraction range information database 130 is up-to-date information, if this information is up-to-date information, then the signal with regulation sends subscriber computer 100 to.If this information is not up-to-date information, then up-to-date extraction range information obtaining section 330 can send the up-to-date text extraction range information that is stored in the up-to-date extraction range information database 500 to operational part 110.

At step S340, the request that response is sent at step S331 or step S333, operational part 110 receives the text that TTS servers 300 transmit and extracts range information out.Be that operational part 110 receives the relevant word of whether only extracting out in the web page text that is presented at present in the browser that is positioned at the mouse-over position, still extract sentence or paragraph out, perhaps extract the text information in full that is comprised in this webpage out.By the up-to-date information of upgrading and store the scope of extracting out about text in the up-to-date extraction range information database 500 of TTS server 300 references by each URL, therefore, at step S340,, be up-to-date information all the time from the information that TTS server 300 receives about text extraction scope.

In addition, operational part 110 will be extracted the up-to-date information of scope out about text from what TTS server 300 received, be stored into text and extract range information database 130 out.If extracting the relevant information of scope out, the text of current web page is present in the text extraction range information database 130, but when TTS server 300 is judged this information and is not up-to-date information, be stored in the information that text extracts out in the range information database 130 and upgraded by the up-to-date information that transmits by TTS server 300; Otherwise, can omit this renewal.In addition, when the judged result in step S330 was "No", the up-to-date text that receives was extracted range information out and newly is stored in the text extraction range information database 130.

At step S350, extract range information out or be stored in text based on the text that receives from TTS server 300 and extract the relevant information that up-to-date text the range information database 130 is extracted scope out out, the extraction mode that decision is extracted word, sentence, paragraph out or needed in full the time.Illustrate in the back which the illustrative text extraction mode that the present invention relates to has.

At step S360,, extract text out based on extract scope and text extraction mode out at the text of step decision before.At this moment, the text of extracting scope out is inverted demonstration etc., visually can distinguish with other text of not extracting out.Therefore, the text that the user can grasp which part in the webpage is drawn out of, and can have what kind of characteristic by this webpage of indirect acknowledgment thus.Further, when the user thinks that webpage and corresponding with it text extract scope out when incorrect, can provide it to TTS server 300 by user feedback.

At step S370, send the text of extracting out among the step S360 to TTS server 300.The TTS transformation component 370 of TTS server 300 becomes voice data with the text transform that receives, and it is retransferred to subscriber computer 100 with reference to the sound mapping database 700 that has stored information needed when text transform become sound.Can come the stored sound data according to the difference of each text behind the coding in the sound mapping database 700, also can come the stored sound data according to the difference of word, sentence or paragraph.

At step S380, subscriber computer 100 receives the voice data that transmits from TTS server 300.

At step S390, the tut data that receive provide portion 119 to provide by the voice data of operational part 110, and this voice data can be by 180 outputs of efferents such as loudspeaker.

As the embodiment in this instructions, in subscriber computer 100, exist the text outside the TTS server 300 to extract range information database 130 out, and be based on the text that is stored in this place basically and extract range information out and extract text out for the basis.But, also can omit these inscapes, will be used to determine the comparable data storehouse of extracting the text of which scope out to be unified into up-to-date extraction range information database 500.Should be appreciated that sound mapping is caught up with and stated the difference that illustrates, needn't can in subscriber computer 100, realize by TTS server 300 with reference to sound mapping database 700.In addition, should be understood that the so-called text that the present invention mentions is extracted out, the different distortion of the present invention along with the foregoing description relates to not only realizes in subscriber computer 100, also can realize on TTS server 300.

Text is extracted the utilization of range information out:

It will be appreciated that according to one embodiment of the invention,, can extract the text and the use of different range out according to the characteristic of webpage.Below, illustrate as distinguishing text and extract the example of webpage characteristic of the standard of scope out which is arranged.

The user can utilize subscriber computer 100 to carry out browsers and connect the webpage that the Internet conducts interviews all has intrinsic URL, and each webpage all has certain characteristic.Such webpage can be divided into news story page or leaf, life information page or leaf, shopping information page or leaf, encyclopaedia page or leaf, language dictionary page or leaf, specialized information page or leaf, blog page or leaf etc. according to the attribute of its content.If the content that comprises in a certain webpage is a news story, utilizes the user of this webpage generally not pay attention to certain words or sentence, and more want to understand the content of news story full text or regulation paragraph.In addition, when the user utilizes the webpage of the so concentrated specialized information of the applicant's famous knowledge services " knowledge iN " column, only can be concerned about the knowledge problem of proposition and answer content.Also have,, generally relatively be concerned about the definition of certain words and the example literary composition of this word is described if utilize the user of encyclopaedia page or leaf or language dictionary webpage.Therefore, according to the attribute or the type of the content that comprises in the webpage, also should be different as providing based on the text on the basis of the text service scope of extracting out.That is, for the webpage that for example comprises news story, general hope is that the text in this page extracted out by unit with paragraph or full text; For the dictionary page or leaf, preferential word and the relative text that is equivalent to explanation portion extracted out of general hope.

For this reason, the text that one embodiment of the invention relate to is extracted out in the range information database 130, can store the characteristic according to each webpage, stores the relevant information that mutually different text is extracted scope out.The text that the present invention relates to is extracted out in the range information database 130, storages that be mapped of the relevant information that the URL of webpage etc. and text can be extracted out scope, and this is with above-mentioned identical.

As required, extract the information of range information database 130 out, can change or delete by user's online/off-line request.But, preferably have only the company that TTS service mainly is provided just to possess access rights to the information of extracting range information database 130 out.As mentioned above, extract the information in the range information database 130 out, can be by being updated to up-to-date information with communicating by letter of TTS server 300.For this reason, can utilize the up-to-date extraction range information database 500 that is included in the TTS server 300 or communicates with.

Obtaining of up-to-date extraction range information:

Based on one embodiment of the invention, illustrate by the extraction range information confirmation unit 120 of operational part 110 to confirm whether text extraction range information exists, again according to its result, obtain the process of up-to-date extraction range information from TTS server 300.

As mentioned above, extract range information confirmation unit 112 out and confirm whether the relevant information of the text extraction scope corresponding with present webpage is present in the text extraction range information database 130 of subscriber computer 100.

Extract the relevant information that does not have the text extraction scope corresponding with the URL of present webpage in the range information database 130 out if be judged as text, then the extraction range information request portion 113 of operational part 110 extracts the information of scope out about the text to 300 requests of TTS server.

Afterwards, the up-to-date extraction range information obtaining section 330 of TTS server 300 is with reference to up-to-date extraction range information database 500, obtain by the text of extracting 113 requests of range information request portion out and extract the relevant information of scope out, and send the operational part 110 of subscriber computer 100 to.Operational part 110 receives the relevant information that text is extracted scope out, it is stored in text extracts out in the range information database 130, extracts the text of present webpage then out based on this.

In addition, according to one embodiment of the invention, if extract out range information confirmation unit 112 judge extract scope out with the text that webpage is corresponding at present relevant information Already in text extract range information database 130 out, then the up-to-date extraction range information request portion 115 of operational part 110 can ask judge whether above-mentioned information is up-to-date information to TTS server 300.

Then, the up-to-date extraction range information judging part 310 of TTS server 300 judges with reference to up-to-date extraction range information database 500 whether the relevant information of the text extraction scope that exists in the present text extraction range information database 130 is up-to-date information.

Its judged result, if the information that exists in the text extraction range information database 130 is identical with the information of existence in the up-to-date extraction range information database 500, TTS server 300 can transmit to the operational part 110 of subscriber computer 100 and be used to confirm that it is the specified signal of up-to-date information that text is extracted the information of range information database 130 out.

In addition, extract the information that exists in the information that exists in the range information database 130 and the up-to-date extraction range information database 500 out not simultaneously if be judged as text, TTS server 300 can send the information of existence in the up-to-date extraction range information database 500 to the operational part 110 of subscriber computer 100.At this moment, operational part 110 can replace the information that is stored in the text extraction range information database 130 with the information that receives.

Text is extracted mode out:

According to one embodiment of the invention, when being the text that comprises in the unit extraction webpage, can use MSAA (Microsoft Active Accessibility) mode or IHTML (Inner HTML) mode to extract out with word, sentence, paragraph or full text.According to one embodiment of the invention, also can be as required decide the extraction mode based on the characteristic of webpage.At this, the MSAA mode be utilize normally used, with Internet Explorer ^TMThe prescribed function that browser provides is together extracted the mode of the text of the specialized range in the webpage out; The IHTML mode is from the webpage made from the HTML form, is the mode (for example, extracting the mode of the text between the regulation label of arranging in advance out) that text extracted out by unit with label (Tag).The text that can be the present invention relates to by the extraction mode determination section 117 shown in Fig. 2 a is extracted mode out.

For example, suppose the webpage that user capture is made with following html source code.

<div?class＝′knCnt′style＝’overflow:hidden；word-wrap:break-word；word-break:break-all；′>

＜P〉mathematics Shi ﹠amp; Nbsp; Closely related with science, and be the important subject that all needs of a lot of subjects＜/P

＜P〉there is not the Nobel Prize why?＜/P 〉

＜P〉please record and narrate in detail Fields Medal＜/P

＜P〉hear it is the Nobel Prize of mathematics circle ...＜/P 〉

</div>

When the mouse-over incident is identified when " section " word location of " closely related with science " takes place in the mouse-over identification part 111 of operational part 110, according to the MSAA mode, can extract out before and after the text nearest label (promptly, example in the literary composition＜P and＜/P) between text " mathematics is closely related with science, and is the important subject that a lot of subjects all need " the words.In addition, use the IHTML mode, can be by＜P〉such html tag is that text extracted out by unit, but also can obtain HTML in full, with＜div〉label is that benchmark is extracted text out.Like this, if with＜div〉label is that benchmark is extracted text out, in above-mentioned example, can extract the full text of text out.

Promptly, when the information of extracting range information database 130 or up-to-date extraction range information database 500 based on text out, when extracting the text of mouse-over position out in webpage, if will extract text out with sentence unit, then the extraction mode determination section 117 of operational part 110 should select the MSAA mode more convenient.In addition, if will extract paragraph out or in full during the text of scope, preferably select the IHTML mode of extracting text out based on the html tag of regulation according to webpage characteristic.

More than Shuo Ming various embodiments of the present invention can realize with the form of forming the programmed instruction of key element execution by various computing machines, and are recorded on the recording medium of embodied on computer readable.The recording medium of embodied on computer readable can comprise programmed instruction, data file, data structure etc. alone or in combination.The above-mentioned programmed instruction that is recorded in the recording medium can design composition especially for the present invention, also can be the known use of technician of computer software fields.The recording medium of embodied on computer readable for example comprises hard disk, floppy disk, the such magnetic medium of tape, the optical recording media that CD-ROM, DVD are such, magnetic-light the medium of soft CD (floptical disk), and ROM, RAM, flash memory etc. can store and the hardware unit of the special formation of execution of program instructions.The example of programmed instruction comprises the machine language code that forms by compiling, also comprises executable on computers higher-level language code such as using interpretive routine.In order to realize action of the present invention, above-mentioned hardware unit can be made up of more than one software module, and vice versa.

As mentioned above, though the present invention that utilized the embodiment of technical characterictic identical with concrete textural element etc. and qualification and description of drawings, this is in order to help more fully to understand the present invention, and the present invention is not limited to the foregoing description.One of ordinary skill in the art of the present invention all can carry out numerous variations and distortion by above-mentioned record.

Therefore, technical scheme of the present invention is not limited to the embodiment of above explanation, and thought category of the present invention not only comprises the scope of claims record, also comprises with claim being equal to or distortion of equal value.

Claims

1. the method based on webpage characteristic abstraction text is characterized in that, this method comprises:

The step of the text indication point on the identification webpage;

Confirm step with the relevant information of the text extraction scope of the corresponding storage of at least a portion of the identifier of above-mentioned webpage;

Based on the relevant information that the position and the above-mentioned confirmed text of above-mentioned text indication point are extracted scope out, determine the step of text extraction scope;

Extract the step that above-mentioned fixed text is extracted the text of scope out out,

Wherein, the above-mentioned text relevant information of extracting scope out comprises and is used for determining extracting word, sentence, paragraph and which information in full out according to above-mentioned webpage characteristic.

2. the method based on webpage characteristic abstraction text is characterized in that, this method comprises:

The step of the text indication point on the identification webpage;

Whether affirmation is storing the step that the text corresponding with at least a portion of the identifier of above-mentioned webpage extracted the relevant information of scope out in text extract information database;

Do not store the relevant information that above-mentioned text is extracted scope out in the above-mentioned text extract information database if confirm as, then receive the step that above-mentioned text is extracted the relevant information of scope out;

Based on the relevant information that the position and the above-mentioned text that receives of above-mentioned text indication point are extracted scope out, determine the step of text extraction scope;

3. method according to claim 1 and 2, wherein,

The position of above-mentioned text indication point is generated by the mouse-over incident.

4. method according to claim 3, wherein,

Above-mentioned mouse-over incident is that the mouse indication point takes place when the regulation zone stop certain hour of above-mentioned webpage is above.

5. method according to claim 1 and 2, wherein,

The identifier of above-mentioned webpage is URL.

6. method according to claim 2, wherein,

Only store the up-to-date information of the scope of extracting out about above-mentioned text in the above-mentioned text extract information database.

7. method according to claim 1 and 2, wherein,

Above-mentioned definite text is extracted the step of scope out, comprises and determines to be to use the MSAA mode also to be to use the IHTML mode to extract the step of the text of above-mentioned webpage out.

8. one kind becomes sound method with text transform, wherein,

The step of the text indication point on the identification webpage;

Extract out above-mentioned fixed text extract out scope the step of text;

Generate the step of the voice data that is associated with the text of extracting out,

9. one kind becomes sound method with text transform,

The step of the text indication point on the identification webpage;

Extract the step that above-mentioned fixed text is extracted the text of scope out out;

10. according to Claim 8 or 9 described methods, wherein,

The voice data of above-mentioned generation is the voice data corresponding with the text of above-mentioned extraction.

11. according to Claim 8 or 9 described methods, wherein,

The voice data of above-mentioned generation is the corresponding voice data of text that has carried out translation with the text with above-mentioned extraction.

12. the system based on webpage characteristic abstraction text is characterized in that, this system comprises:

Text indication point identification part, the text indication point on the identification webpage;

Text is extracted the range information confirmation unit out, confirms the relevant information with the text extraction scope of the corresponding storage of at least a portion of the identifier of above-mentioned webpage;

Text is extracted the scope determination portion out, based on the relevant information that the position and the above-mentioned confirmed text of above-mentioned text indication point are extracted scope out, determines text extraction scope;

The text extraction unit is extracted above-mentioned fixed text out and is extracted the text of scope out,

13. the system based on webpage characteristic abstraction text is characterized in that, this system comprises:

Text is extracted the range information acceptance division out, confirms whether to store in the text extract information database relevant information that the text corresponding with at least a portion of the identifier of above-mentioned webpage extracted scope out, when not storing, receives the relevant information that text is extracted scope out;

Text is extracted the scope determination portion out, based on the relevant information that the position and the above-mentioned text that receives of above-mentioned text indication point are extracted scope out, determines text extraction scope;

14. according to claim 12 or 13 described systems, wherein,

15. system according to claim 14, wherein,

16. according to claim 12 or 13 described systems, wherein,

The identifier of above-mentioned webpage is URL.

17. system according to claim 13, wherein,

18. according to claim 12 or 13 described systems, wherein,

In above-mentioned extraction range of text determination portion, determine to be to use the MSAA mode, also be to use the IHTML mode to extract the text of above-mentioned webpage out.

19. one kind becomes the system of sound with text transform, it is characterized in that this system comprises:

The text extraction unit is extracted above-mentioned fixed text out and is extracted the text of scope out;

The voice data generating unit generates the voice data that is associated with the text of above-mentioned extraction,

20. one kind becomes the system of sound with text transform, it is characterized in that this system comprises:

Text is extracted the range information acceptance division out, whether affirmation is storing the relevant information that the text corresponding with at least a portion of the identifier of above-mentioned webpage extracted scope out in text extract information database, if do not store, then receive the information of the scope of extracting out about above-mentioned text;

Text is extracted the scope determination section out, based on the relevant information that the position and the above-mentioned text that receives of above-mentioned text indication point are extracted scope out, determines text extraction scope;

The text extraction unit is extracted the above-mentioned text of having determined text extraction scope out;

21. according to claim 19 or 20 described systems, wherein,

The voice data that generates in tut data generating unit is corresponding with the text of above-mentioned extraction.

22. according to claim 19 or 20 described systems, wherein,

It is corresponding that the voice data that generates in tut data generating unit and the text with above-mentioned extraction have carried out the text translated.