CN101441648B - Method and system based on webpage characteristic abstraction text - Google Patents
Method and system based on webpage characteristic abstraction text Download PDFInfo
- Publication number
- CN101441648B CN101441648B CN2008101770713A CN200810177071A CN101441648B CN 101441648 B CN101441648 B CN 101441648B CN 2008101770713 A CN2008101770713 A CN 2008101770713A CN 200810177071 A CN200810177071 A CN 200810177071A CN 101441648 B CN101441648 B CN 101441648B
- Authority
- CN
- China
- Prior art keywords
- text
- mentioned
- scope
- extracted
- webpage
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 39
- 238000000605 extraction Methods 0.000 claims description 105
- 239000000284 extract Substances 0.000 claims description 78
- 238000012790 confirmation Methods 0.000 claims description 12
- 238000003860 storage Methods 0.000 claims description 11
- 230000033228 biological regulation Effects 0.000 claims description 6
- 238000013519 translation Methods 0.000 claims description 3
- 238000005516 engineering process Methods 0.000 description 15
- 238000013507 mapping Methods 0.000 description 11
- 230000008569 process Effects 0.000 description 8
- 239000000203 mixture Substances 0.000 description 5
- 230000015572 biosynthetic process Effects 0.000 description 3
- 238000004891 communication Methods 0.000 description 3
- 230000009466 transformation Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 230000005055 memory storage Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 229920002457 flexible plastic Polymers 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000014759 maintenance of location Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000012797 qualification Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/166—Editing, e.g. inserting or deleting
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Business, Economics & Management (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Tourism & Hospitality (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Marketing (AREA)
- Acoustics & Sound (AREA)
- Strategic Management (AREA)
- Primary Health Care (AREA)
- Human Resources & Organizations (AREA)
- Economics (AREA)
- Human Computer Interaction (AREA)
- General Business, Economics & Management (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Information Transfer Between Computers (AREA)
- Document Processing Apparatus (AREA)
- Machine Translation (AREA)
Abstract
The invention provides a method and a system for extracting test based on characteristic of web page, and a computer-readable recording medium for executing the method. In particular, one embodiment of the invention relates to a method for extracting test comprising: recognizing text indication point on web page; confirming corresponding information that at least a part of Identifier of the web page corresponds with stored text extracting range; determining the text extracting range based on the indication point information and the corresponding information of the confirmed text extracting range.
Description
Technical field
The present invention relates to a kind of method and system based on webpage characteristic abstraction text, more particularly, the present invention relates to a kind ofly when after the text of extracting out in the webpage is provided, utilizing the text that text baseds service such as sound mapping or translation is provided, can extract the recording medium of method, system and embodied on computer readable of the text of different range such as word, sentence, paragraph and full text according to webpage characteristic out.
Background technology
In recent years, along with using popularizing of the Internet, can obtain diversified information by the Internet.In order to satisfy requirements of different users, provide the enterprise of Internet service that diversified service is being provided by the website, its type service also is the gesture that increases progressively day by day.
The Internet user contacts the service that these enterprises provide in a variety of forms, particularly by the website obtain news information, dictinary information, specialized information, diversified internet contents such as domain information, shopping information.
These users are in order to obtain own required content, by retrieving and when particular webpage obtains required content, generally be this content that mainly constitutes with textual form with the eyes textual research and explain in the website.But,, only utilize these contents that mainly provides, being not being met now of multimedia era with textual form from user's position.In fact, along with the continuous increase of the quantity of information that comprises in the webpage, when the whole content that provides with textual form with the eyes textual research and explain as the user also was provided, sight line can not be left the problem of the such display device of the monitor of computing machine for example.And Yi Bian also exist multitasking (multitasking) person who requires one side to obtain information needed from content, carry out other work among the user, these requirements also are not being met.
In addition, in recent years, CTI (Computer TelephonyIntegration such as VoIP (Voice over IP) technology, voice recognition technology, sound mapping technology, voice synthesis, automatic answering system, computer telephone integration) technology has obtained a lot of concerns, and we also can obtain the further Internet service that the user sends indication, exchanges with the sound acquired information, with sound with sound by these technology at expectation in internet environment.
Therefore, in the problem that exists when the content based on text is provided in solution, use the CTI technology more widely, developed TTS (Text To Speech) technology.The TTS technology is can more widely used technology than voice recognition technology, is a kind of various text messages to be transformed into the user-centered interface technology that sound provides.TTS technology on the webpage mainly is to realize to extract text out and it is transformed into the mode that offers the user behind the sound from webpage.For example, mouse-over (mouse-over) incident that takes place when on a certain position of webpage, mouse being paused certain hour according to the user, extract the word corresponding with mouse indication point position out and be transformed into sound, perhaps user's part of pulling the text on the webpage is transformed into sound with it.
But the TTS that is realizing at present, provide by webpage serves the user-centered interface technology that can not say so perfect.Specifically, exist and the word of operating the position of discerning by user's mouse-over can only be transformed into sound in the present TTS service, perhaps can only make the user directly pull the problem that mouse is specified the text that will be transformed into sound.Under the former situation, exist the problem that word at mouse-over place is not transformed into without exception sound according to user's intention.In addition, in the latter case, the user becomes sound for the text transform with expected range, elder generation roughly understands after the text with eyes, specify the scope of the text that becomes the sound mapping object again, this has not only run counter to the original idea of as far as possible being avoided the user directly to understand the TTS technology of text, and above-mentioned assigned operation also needs the extra time.
Therefore, need following access technique, promptly extract the text of particular range (for example word, sentence, paragraph or scope in full) out,, improve user convenience thus so that various text based services to be provided according to user's intention and according to the characteristic of webpage.
Summary of the invention
The objective of the invention is to, a kind of method and system of extracting text according to the characteristic of webpage on one's own initiative out are provided.
And, the objective of the invention is to, initiatively extract the text of different range out according to the characteristic of webpage, web page user is obtained easily from the data of text conversion.
In addition, another object of the present invention is to, when the user extracts the text of relative broad range out in webpage, reduce the inconvenience sense that pulls mouse one by one, extract the text of required scope automatically out, thereby reduce unnecessary user's operation according to the characteristic of webpage.
In order to realize aforesaid purpose of the present invention, it is representational composed as follows.
A kind of technical scheme of the present invention is based on the method for webpage characteristic abstraction text, and this method comprises: the step of the text indication point on the identification webpage; Confirm step with the relevant information of the text extraction scope of the corresponding storage of at least a portion of the identifier of above-mentioned webpage; Based on the relevant information of above-mentioned text indication point information and above-mentioned confirmed text extraction scope, the decision text is extracted the step of scope out; Extract the above-mentioned step that has determined the text of scope out.
Another technical approach of the present invention is based on the method for webpage characteristic abstraction text, and this method comprises: the step of the text indication point on the identification webpage; Whether affirmation is storing the step that the corresponding text of at least a portion in the identifier with above-mentioned webpage is extracted the relevant information of scope out in text extract information database; Receive above-mentioned text and extract the step of the relevant information of scope out; Based on the relevant information of above-mentioned text indication point information and the above-mentioned text extraction scope that receives, the decision text is extracted the step of scope out; Extract the above-mentioned step that has determined the text of scope out.
Another technical scheme that the present invention relates to is that text transform is become sound method, and this method also comprises the step that generates the voice data that is associated with the text of extracting out according to the method described above.
A kind of system that the present invention relates to based on webpage characteristic abstraction text, it comprises: text indication point identification part, the text indication point on the identification webpage; Text is extracted the range information confirmation unit out, confirms the relevant information of the text extraction scope of the corresponding storage of at least a portion in the identifier with above-mentioned webpage; Text is extracted the scope determination section out, and based on the relevant information of above-mentioned text indication point information and above-mentioned confirmed text extraction scope, the decision text is extracted scope out; The text extraction unit is extracted the text of above-mentioned determined scope out.
Another technical scheme that the present invention relates to relates to the system based on webpage characteristic abstraction text, comprising: text extract information database; Text indication point identification part, the text indication point on the identification webpage; Text is extracted the range information acceptance division out, confirms whether to store in the above-mentioned text extract information database relevant information that the text corresponding with at least a portion of the identifier of above-mentioned webpage extracted scope out, when not storing, receives the relevant information that text is extracted scope out; Text is extracted the scope determination section out, and based on the relevant information of above-mentioned text indication point information and the above-mentioned text extraction scope that receives, the decision text is extracted scope out; The text extraction unit is extracted the text of above-mentioned determined scope out.
The another kind of technical scheme that the present invention relates to is the system that text transform is become sound, and it comprises: text indication point identification part, the text indication point on the identification webpage; Text is extracted the range information confirmation unit out, confirms the relevant information with the text extraction scope of the corresponding storage of at least a portion of the identifier of above-mentioned webpage; Text is extracted the scope determination section out, and based on the relevant information of above-mentioned text indication point information and above-mentioned confirmed text extraction scope, the decision text is extracted scope out; The text extraction unit is extracted the text of above-mentioned determined scope out; The voice data generating unit generates the voice data that is associated with the text of above-mentioned extraction.
The another kind of technical scheme that the present invention relates to is the system that text transform is become sound, and it comprises: text indication point identification part, the text indication point on the identification webpage; Text is extracted the range information acceptance division out, whether affirmation is storing the relevant information that the text corresponding with at least a portion of the identifier of above-mentioned webpage extracted scope out in above-mentioned text extract information database, if do not store, then receive the information of the scope of extracting out about above-mentioned text; Text is extracted the scope determination section out, and based on the relevant information of above-mentioned text indication point information and the above-mentioned text extraction scope that receives, the decision text is extracted scope out; The text extraction unit is extracted the text of above-mentioned determined scope out; The voice data generating unit generates the voice data that is associated with the text of above-mentioned extraction.
In addition, the present invention also provides the recording medium of the embodied on computer readable of the computer program that a kind of other method, system and having write down based on webpage characteristic abstraction text be used to carry out said method.
According to the present invention, extract text on one's own initiative out according to webpage characteristic, text based services such as sound mapping service or translation service are provided based on this, make the user not need a lot of operations, just can obtain meet customer requirements based on text data.
And, according to the present invention, when the user utilizes webpage under not knowing the situation of webpage characteristic, also can extract the text of the scope that meets this characteristic automatically out, make the user can grasp the content of representing in the webpage effectively.
In addition,, when the user will extract the text of wider range out on webpage, can eliminate the inconvenience sense that the user need pull full text, can prevent that the text that the error when pulling because of mouse causes from extracting mistake out according to the present invention.
Description of drawings
Fig. 1 is that the text that expression one embodiment of the invention relate to is extracted the figure of the general configuration of system out;
Fig. 2 a is the figure of the detailed structure of the subscriber computer in the expression text extraction system shown in Figure 1;
Fig. 2 b is the figure of the detailed structure of the TTS server in the expression text extraction system shown in Figure 1;
Fig. 3 is the process flow diagram of representing the extraction text that one embodiment of the invention relates to and the text transform of extracting out being become the process of sound.
Embodiment
Below, the various embodiment that present invention will be described in detail with reference to the accompanying.
The composition of total system:
Fig. 1 is that the text that expression one embodiment of the invention relate to is extracted the figure of the general configuration of system out.
As shown in Figure 1, the text that relates to of one embodiment of the invention is extracted system out and can be comprised subscriber computer 100 and TTS server 300.At this, subscriber computer 100 and TTS server 300 are to communicate by short-haul connections net (LAN) or the telecommunicatio network multiple network environment such as (WAM) that utilizes dedicated line.Such network environment can be known WWW (World Wide Web).In addition, TTS server 300 can carry out two-way communication with more than one subscriber computer 100 by Internet protocol in known network environment.And, this TTS server 300 can according to from the request of subscriber computer 100, and handle with reference to up-to-date extraction range information database 500 and sound mapping database 700.
The formation of subscriber computer:
Fig. 2 a is the figure of the detailed structure of subscriber computer 100 in the expression text extraction system shown in Figure 1.Fig. 2 b is the figure of the detailed structure of expression TTS server 300.
Shown in Fig. 2 a, subscriber computer 100 can comprise operational part 110, text extraction range information database 130, program reservoir 150, user's input part 170, efferent 180 and Department of Communication Force 190.
In addition, operational part 110 as required can with reference to the such webpage identifier of for example URL (Uniform ResourceLocator) corresponding that store, stored with webpage in text (for example extract the relevant information of scope out, information according to webpage characteristic text of which scope of extraction in word, sentence, paragraph and full text) text is extracted range information database 130 out, and the inscape that above-mentioned text extraction range information database 130 can be used as operational part 110 comprises.
In addition, operational part 110 also appends the driven by program portion (not shown) that comprises, so that the program that stores in the driver reservoir 150 simultaneously when the user carries out browser is promptly extracted text out or is utilized the text of extracting out that the program of text based service is provided according to the present invention.Program reservoir 150 does not need certain involved as an element of subscriber computer 100, the known recording medium that can use a computer and can read, and promptly recording mediums such as hard disk, floppy disk, flexible plastic disc, tape, CD-ROM, DVD replace.
User's input part 170 can be common computing machine input mechanism, be keyboard or mouse etc. that efferent 180 can be with the computer monitor of Visual Display browser and/or person's webpage or with the realizations such as loudspeaker of text with voice output.
The composition of server:
In addition, TTS server 300 shown in Figure 2 can provide the TTS service, promptly at least a portion text transform in the webpage be become the server that provides it to the such service of user behind the sound.Such TTS server 300 can be the web page server of the Internet portal website, also can be the management server that the enterprise of TTS service only is provided specially.According to another embodiment of the present invention, TTS server 300 can replace with serving not directly related general networking server with TTS.
The TTS server 300 that one embodiment of the invention relate to can comprise up-to-date extraction range information judging part 310, up-to-date extraction range information obtaining section 330 and TTS transformation component 370.According to one embodiment of the invention, at least a portion in up-to-date extraction range information judging part 310, up-to-date extraction range information obtaining section 330 and the TTS transformation component 370 is comprised in the TTS server 300, or the program module that communicates with TTS server 300.Such program module can be comprised in the form of management system, application program module and other program module in the TTS server 300, physically can be stored in the various known memory storages.And such program module can also be stored in the remote storage that can communicate with TTS server 300.Such program module comprises according to the present invention to be carried out particular task described later or realizes routine (routine), subroutine, program, object, assembly (component), data structure of specific abstract data type etc., but and is limited to these.
As a reference, each element shown in Fig. 1 and Fig. 2 a, Fig. 2 b is interpreted as required receiving and transmitting signal mutually, and still, known communication agency of the present invention about realizing, as to be used to exchange aforesaid signal does not elaborate at this.
The extraction of text and sound mapping:
Fig. 3 is that expression is extracted text out according to one embodiment of the invention and the text transform of extracting out become the process flow diagram of the process of sound.At this, describe the process that becomes sound and output according to the process of text in one embodiment of the invention extraction webpage with the text transform that will extract out in detail with reference to figure 2a, Fig. 2 b and Fig. 3.
When the user utilizes subscriber computer 100 to carry out browser, then extract text out and become the program of voice output to be driven simultaneously the text transform of extracting out according to one embodiment of the invention.This program can be stored in the program reservoir 150 that is contained in subscriber computer 100 inside as described above, also can be stored in other recording medium.
Afterwards, the user can connect the Internet, has the webpage of predetermined URL by the browser access that has started.In addition, a lot of servers provide the content that can read by browser, in order to represent their position, use URL usually.Such URL is used for expressing the document location of each server on the Internet, but URL has the attribute that can more freely set, therefore can comprise the out of Memory (for example, the text that relates to about one embodiment of the invention is extracted the information of scope out) that is used to represent webpage characteristic.In any case, the part of URL or URL can be corresponding with the information relevant with the text extraction scope that the present invention relates to.
With reference to Fig. 3, illustrate that one embodiment of the invention relate to, from webpage, extract text out and output is carried out the process of the data of sound mapping with it.
At first, when the user make the mouse indication point be arranged in by text browser display, that be included in webpage of subscriber computer 100 on the time, at step S310, the mouse-over identification part 111 of operational part 110 judges whether to have taken place the mouse-over incident.
In step S330, extract range information confirmation unit 112 out and judge that text extracts the relevant information that whether exists in the range information database 130 with the text extraction scope of the corresponding storage of URL of current web page out.As briefly mentioning before, text is extracted the information of storing the scope of extracting out about text in the range information database 130 with the URL of webpage accordingly out.The information about text extraction scope like this can store respectively by URL, after also can distinguishing by the several types of corresponding web page, concentrates to store.To this, elaborate again below.
At step S330, if extracting range information confirmation unit 112 out is judged as the text corresponding with the URL of current web page and extracts the relevant information of scope out when not being present in text and extracting range information database 130 out, at step S331, extract range information request portion 113 extracts scope out to the TTS server 300 requests text corresponding with the URL of current web page relevant information out.According to one embodiment of the invention, the up-to-date extraction range information database 500 of TTS server 300 references, be updated periodically and be used to provide the TTS service required various information, promptly about the text of each URL extract out the information of scope with about the relevant information of the type of webpage that provides respectively by URL, so that store the up-to-date information of the scope of extracting out about text.If extract the relevant information that the range information request portion 113 requests text corresponding with the URL of current web page extracted scope out out, then the up-to-date extraction range information obtaining section 330 of TTS server 300 sends up-to-date information to the operational part 110 of subscriber computer 100 with reference to up-to-date extraction range information database 500.
At step S330, be judged as the text corresponding and extract the relevant information of scope out when being present in text and extracting range information database 130 out when extracting range information confirmation unit 112 out with this URL, in step S333, the up-to-date extraction range information request portion 115 of operational part 110 judges that text extracts whether the information that exists in the range information database 130 is up-to-date information out, if not up-to-date information, then send the request that is used for obtaining up-to-date information from TTS server 300.The up-to-date extraction range information judging part 310 of TTS server 300 is according to the request of up-to-date extraction range information request portion 115, with reference to the information that is stored in the up-to-date extraction range information database 500, judge whether the information that exists in the text extraction range information database 130 is up-to-date information, if this information is up-to-date information, then the signal with regulation sends subscriber computer 100 to.If this information is not up-to-date information, then up-to-date extraction range information obtaining section 330 can send the up-to-date text extraction range information that is stored in the up-to-date extraction range information database 500 to operational part 110.
At step S340, the request that response is sent at step S331 or step S333, operational part 110 receives the text that TTS servers 300 transmit and extracts range information out.Be that operational part 110 receives the relevant word of whether only extracting out in the web page text that is presented at present in the browser that is positioned at the mouse-over position, still extract sentence or paragraph out, perhaps extract the text information in full that is comprised in this webpage out.By the up-to-date information of upgrading and store the scope of extracting out about text in the up-to-date extraction range information database 500 of TTS server 300 references by each URL, therefore, at step S340,, be up-to-date information all the time from the information that TTS server 300 receives about text extraction scope.
In addition, operational part 110 will be extracted the up-to-date information of scope out about text from what TTS server 300 received, be stored into text and extract range information database 130 out.If extracting the relevant information of scope out, the text of current web page is present in the text extraction range information database 130, but when TTS server 300 is judged this information and is not up-to-date information, be stored in the information that text extracts out in the range information database 130 and upgraded by the up-to-date information that transmits by TTS server 300; Otherwise, can omit this renewal.In addition, when the judged result in step S330 was "No", the up-to-date text that receives was extracted range information out and newly is stored in the text extraction range information database 130.
At step S350, extract range information out or be stored in text based on the text that receives from TTS server 300 and extract the relevant information that up-to-date text the range information database 130 is extracted scope out out, the extraction mode that decision is extracted word, sentence, paragraph out or needed in full the time.Illustrate in the back which the illustrative text extraction mode that the present invention relates to has.
At step S360,, extract text out based on extract scope and text extraction mode out at the text of step decision before.At this moment, the text of extracting scope out is inverted demonstration etc., visually can distinguish with other text of not extracting out.Therefore, the text that the user can grasp which part in the webpage is drawn out of, and can have what kind of characteristic by this webpage of indirect acknowledgment thus.Further, when the user thinks that webpage and corresponding with it text extract scope out when incorrect, can provide it to TTS server 300 by user feedback.
At step S370, send the text of extracting out among the step S360 to TTS server 300.The TTS transformation component 370 of TTS server 300 becomes voice data with the text transform that receives, and it is retransferred to subscriber computer 100 with reference to the sound mapping database 700 that has stored information needed when text transform become sound.Can come the stored sound data according to the difference of each text behind the coding in the sound mapping database 700, also can come the stored sound data according to the difference of word, sentence or paragraph.
At step S380, subscriber computer 100 receives the voice data that transmits from TTS server 300.
At step S390, the tut data that receive provide portion 119 to provide by the voice data of operational part 110, and this voice data can be by 180 outputs of efferents such as loudspeaker.
As the embodiment in this instructions, in subscriber computer 100, exist the text outside the TTS server 300 to extract range information database 130 out, and be based on the text that is stored in this place basically and extract range information out and extract text out for the basis.But, also can omit these inscapes, will be used to determine the comparable data storehouse of extracting the text of which scope out to be unified into up-to-date extraction range information database 500.Should be appreciated that sound mapping is caught up with and stated the difference that illustrates, needn't can in subscriber computer 100, realize by TTS server 300 with reference to sound mapping database 700.In addition, should be understood that the so-called text that the present invention mentions is extracted out, the different distortion of the present invention along with the foregoing description relates to not only realizes in subscriber computer 100, also can realize on TTS server 300.
Text is extracted the utilization of range information out:
It will be appreciated that according to one embodiment of the invention,, can extract the text and the use of different range out according to the characteristic of webpage.Below, illustrate as distinguishing text and extract the example of webpage characteristic of the standard of scope out which is arranged.
The user can utilize subscriber computer 100 to carry out browsers and connect the webpage that the Internet conducts interviews all has intrinsic URL, and each webpage all has certain characteristic.Such webpage can be divided into news story page or leaf, life information page or leaf, shopping information page or leaf, encyclopaedia page or leaf, language dictionary page or leaf, specialized information page or leaf, blog page or leaf etc. according to the attribute of its content.If the content that comprises in a certain webpage is a news story, utilizes the user of this webpage generally not pay attention to certain words or sentence, and more want to understand the content of news story full text or regulation paragraph.In addition, when the user utilizes the webpage of the so concentrated specialized information of the applicant's famous knowledge services " knowledge iN " column, only can be concerned about the knowledge problem of proposition and answer content.Also have,, generally relatively be concerned about the definition of certain words and the example literary composition of this word is described if utilize the user of encyclopaedia page or leaf or language dictionary webpage.Therefore, according to the attribute or the type of the content that comprises in the webpage, also should be different as providing based on the text on the basis of the text service scope of extracting out.That is, for the webpage that for example comprises news story, general hope is that the text in this page extracted out by unit with paragraph or full text; For the dictionary page or leaf, preferential word and the relative text that is equivalent to explanation portion extracted out of general hope.
For this reason, the text that one embodiment of the invention relate to is extracted out in the range information database 130, can store the characteristic according to each webpage, stores the relevant information that mutually different text is extracted scope out.The text that the present invention relates to is extracted out in the range information database 130, storages that be mapped of the relevant information that the URL of webpage etc. and text can be extracted out scope, and this is with above-mentioned identical.
As required, extract the information of range information database 130 out, can change or delete by user's online/off-line request.But, preferably have only the company that TTS service mainly is provided just to possess access rights to the information of extracting range information database 130 out.As mentioned above, extract the information in the range information database 130 out, can be by being updated to up-to-date information with communicating by letter of TTS server 300.For this reason, can utilize the up-to-date extraction range information database 500 that is included in the TTS server 300 or communicates with.
Obtaining of up-to-date extraction range information:
Based on one embodiment of the invention, illustrate by the extraction range information confirmation unit 120 of operational part 110 to confirm whether text extraction range information exists, again according to its result, obtain the process of up-to-date extraction range information from TTS server 300.
As mentioned above, extract range information confirmation unit 112 out and confirm whether the relevant information of the text extraction scope corresponding with present webpage is present in the text extraction range information database 130 of subscriber computer 100.
Extract the relevant information that does not have the text extraction scope corresponding with the URL of present webpage in the range information database 130 out if be judged as text, then the extraction range information request portion 113 of operational part 110 extracts the information of scope out about the text to 300 requests of TTS server.
Afterwards, the up-to-date extraction range information obtaining section 330 of TTS server 300 is with reference to up-to-date extraction range information database 500, obtain by the text of extracting 113 requests of range information request portion out and extract the relevant information of scope out, and send the operational part 110 of subscriber computer 100 to.Operational part 110 receives the relevant information that text is extracted scope out, it is stored in text extracts out in the range information database 130, extracts the text of present webpage then out based on this.
In addition, according to one embodiment of the invention, if extract out range information confirmation unit 112 judge extract scope out with the text that webpage is corresponding at present relevant information Already in text extract range information database 130 out, then the up-to-date extraction range information request portion 115 of operational part 110 can ask judge whether above-mentioned information is up-to-date information to TTS server 300.
Then, the up-to-date extraction range information judging part 310 of TTS server 300 judges with reference to up-to-date extraction range information database 500 whether the relevant information of the text extraction scope that exists in the present text extraction range information database 130 is up-to-date information.
Its judged result, if the information that exists in the text extraction range information database 130 is identical with the information of existence in the up-to-date extraction range information database 500, TTS server 300 can transmit to the operational part 110 of subscriber computer 100 and be used to confirm that it is the specified signal of up-to-date information that text is extracted the information of range information database 130 out.
In addition, extract the information that exists in the information that exists in the range information database 130 and the up-to-date extraction range information database 500 out not simultaneously if be judged as text, TTS server 300 can send the information of existence in the up-to-date extraction range information database 500 to the operational part 110 of subscriber computer 100.At this moment, operational part 110 can replace the information that is stored in the text extraction range information database 130 with the information that receives.
Text is extracted mode out:
According to one embodiment of the invention, when being the text that comprises in the unit extraction webpage, can use MSAA (Microsoft Active Accessibility) mode or IHTML (Inner HTML) mode to extract out with word, sentence, paragraph or full text.According to one embodiment of the invention, also can be as required decide the extraction mode based on the characteristic of webpage.At this, the MSAA mode be utilize normally used, with Internet Explorer
TMThe prescribed function that browser provides is together extracted the mode of the text of the specialized range in the webpage out; The IHTML mode is from the webpage made from the HTML form, is the mode (for example, extracting the mode of the text between the regulation label of arranging in advance out) that text extracted out by unit with label (Tag).The text that can be the present invention relates to by the extraction mode determination section 117 shown in Fig. 2 a is extracted mode out.
For example, suppose the webpage that user capture is made with following html source code.
<div?class=′knCnt′style=’overflow:hidden;word-wrap:break-word;word-break:break-all;′>
<P〉mathematics Shi ﹠amp; Nbsp; Closely related with science, and be the important subject that all needs of a lot of subjects</P
<P〉there is not the Nobel Prize why?</P 〉
<P〉please record and narrate in detail Fields Medal</P
<P〉hear it is the Nobel Prize of mathematics circle ...</P 〉
</div>
When the mouse-over incident is identified when " section " word location of " closely related with science " takes place in the mouse-over identification part 111 of operational part 110, according to the MSAA mode, can extract out before and after the text nearest label (promptly, example in the literary composition<P and</P) between text " mathematics is closely related with science, and is the important subject that a lot of subjects all need " the words.In addition, use the IHTML mode, can be by<P〉such html tag is that text extracted out by unit, but also can obtain HTML in full, with<div〉label is that benchmark is extracted text out.Like this, if with<div〉label is that benchmark is extracted text out, in above-mentioned example, can extract the full text of text out.
Promptly, when the information of extracting range information database 130 or up-to-date extraction range information database 500 based on text out, when extracting the text of mouse-over position out in webpage, if will extract text out with sentence unit, then the extraction mode determination section 117 of operational part 110 should select the MSAA mode more convenient.In addition, if will extract paragraph out or in full during the text of scope, preferably select the IHTML mode of extracting text out based on the html tag of regulation according to webpage characteristic.
More than Shuo Ming various embodiments of the present invention can realize with the form of forming the programmed instruction of key element execution by various computing machines, and are recorded on the recording medium of embodied on computer readable.The recording medium of embodied on computer readable can comprise programmed instruction, data file, data structure etc. alone or in combination.The above-mentioned programmed instruction that is recorded in the recording medium can design composition especially for the present invention, also can be the known use of technician of computer software fields.The recording medium of embodied on computer readable for example comprises hard disk, floppy disk, the such magnetic medium of tape, the optical recording media that CD-ROM, DVD are such, magnetic-light the medium of soft CD (floptical disk), and ROM, RAM, flash memory etc. can store and the hardware unit of the special formation of execution of program instructions.The example of programmed instruction comprises the machine language code that forms by compiling, also comprises executable on computers higher-level language code such as using interpretive routine.In order to realize action of the present invention, above-mentioned hardware unit can be made up of more than one software module, and vice versa.
As mentioned above, though the present invention that utilized the embodiment of technical characterictic identical with concrete textural element etc. and qualification and description of drawings, this is in order to help more fully to understand the present invention, and the present invention is not limited to the foregoing description.One of ordinary skill in the art of the present invention all can carry out numerous variations and distortion by above-mentioned record.
Therefore, technical scheme of the present invention is not limited to the embodiment of above explanation, and thought category of the present invention not only comprises the scope of claims record, also comprises with claim being equal to or distortion of equal value.
Claims (22)
1. the method based on webpage characteristic abstraction text is characterized in that, this method comprises:
The step of the text indication point on the identification webpage;
Confirm step with the relevant information of the text extraction scope of the corresponding storage of at least a portion of the identifier of above-mentioned webpage;
Based on the relevant information that the position and the above-mentioned confirmed text of above-mentioned text indication point are extracted scope out, determine the step of text extraction scope;
Extract the step that above-mentioned fixed text is extracted the text of scope out out,
Wherein, the above-mentioned text relevant information of extracting scope out comprises and is used for determining extracting word, sentence, paragraph and which information in full out according to above-mentioned webpage characteristic.
2. the method based on webpage characteristic abstraction text is characterized in that, this method comprises:
The step of the text indication point on the identification webpage;
Whether affirmation is storing the step that the text corresponding with at least a portion of the identifier of above-mentioned webpage extracted the relevant information of scope out in text extract information database;
Do not store the relevant information that above-mentioned text is extracted scope out in the above-mentioned text extract information database if confirm as, then receive the step that above-mentioned text is extracted the relevant information of scope out;
Based on the relevant information that the position and the above-mentioned text that receives of above-mentioned text indication point are extracted scope out, determine the step of text extraction scope;
Extract the step that above-mentioned fixed text is extracted the text of scope out out,
Wherein, the above-mentioned text relevant information of extracting scope out comprises and is used for determining extracting word, sentence, paragraph and which information in full out according to above-mentioned webpage characteristic.
3. method according to claim 1 and 2, wherein,
The position of above-mentioned text indication point is generated by the mouse-over incident.
4. method according to claim 3, wherein,
Above-mentioned mouse-over incident is that the mouse indication point takes place when the regulation zone stop certain hour of above-mentioned webpage is above.
5. method according to claim 1 and 2, wherein,
The identifier of above-mentioned webpage is URL.
6. method according to claim 2, wherein,
Only store the up-to-date information of the scope of extracting out about above-mentioned text in the above-mentioned text extract information database.
7. method according to claim 1 and 2, wherein,
Above-mentioned definite text is extracted the step of scope out, comprises and determines to be to use the MSAA mode also to be to use the IHTML mode to extract the step of the text of above-mentioned webpage out.
8. one kind becomes sound method with text transform, wherein,
The step of the text indication point on the identification webpage;
Confirm step with the relevant information of the text extraction scope of the corresponding storage of at least a portion of the identifier of above-mentioned webpage;
Based on the relevant information that the position and the above-mentioned confirmed text of above-mentioned text indication point are extracted scope out, determine the step of text extraction scope;
Extract out above-mentioned fixed text extract out scope the step of text;
Generate the step of the voice data that is associated with the text of extracting out,
Wherein, the above-mentioned text relevant information of extracting scope out comprises and is used for determining extracting word, sentence, paragraph and which information in full out according to above-mentioned webpage characteristic.
9. one kind becomes sound method with text transform,
The step of the text indication point on the identification webpage;
Whether affirmation is storing the step that the text corresponding with at least a portion of the identifier of above-mentioned webpage extracted the relevant information of scope out in text extract information database;
Do not store the relevant information that above-mentioned text is extracted scope out in the above-mentioned text extract information database if confirm as, then receive the step that above-mentioned text is extracted the relevant information of scope out;
Based on the relevant information that the position and the above-mentioned text that receives of above-mentioned text indication point are extracted scope out, determine the step of text extraction scope;
Extract the step that above-mentioned fixed text is extracted the text of scope out out;
Generate the step of the voice data that is associated with the text of extracting out,
Wherein, the above-mentioned text relevant information of extracting scope out comprises and is used for determining extracting word, sentence, paragraph and which information in full out according to above-mentioned webpage characteristic.
10. according to Claim 8 or 9 described methods, wherein,
The voice data of above-mentioned generation is the voice data corresponding with the text of above-mentioned extraction.
11. according to Claim 8 or 9 described methods, wherein,
The voice data of above-mentioned generation is the corresponding voice data of text that has carried out translation with the text with above-mentioned extraction.
12. the system based on webpage characteristic abstraction text is characterized in that, this system comprises:
Text indication point identification part, the text indication point on the identification webpage;
Text is extracted the range information confirmation unit out, confirms the relevant information with the text extraction scope of the corresponding storage of at least a portion of the identifier of above-mentioned webpage;
Text is extracted the scope determination portion out, based on the relevant information that the position and the above-mentioned confirmed text of above-mentioned text indication point are extracted scope out, determines text extraction scope;
The text extraction unit is extracted above-mentioned fixed text out and is extracted the text of scope out,
Wherein, the above-mentioned text relevant information of extracting scope out comprises and is used for determining extracting word, sentence, paragraph and which information in full out according to above-mentioned webpage characteristic.
13. the system based on webpage characteristic abstraction text is characterized in that, this system comprises:
Text indication point identification part, the text indication point on the identification webpage;
Text is extracted the range information acceptance division out, confirms whether to store in the text extract information database relevant information that the text corresponding with at least a portion of the identifier of above-mentioned webpage extracted scope out, when not storing, receives the relevant information that text is extracted scope out;
Text is extracted the scope determination portion out, based on the relevant information that the position and the above-mentioned text that receives of above-mentioned text indication point are extracted scope out, determines text extraction scope;
The text extraction unit is extracted above-mentioned fixed text out and is extracted the text of scope out,
Wherein, the above-mentioned text relevant information of extracting scope out comprises and is used for determining extracting word, sentence, paragraph and which information in full out according to above-mentioned webpage characteristic.
14. according to claim 12 or 13 described systems, wherein,
The position of above-mentioned text indication point is generated by the mouse-over incident.
15. system according to claim 14, wherein,
Above-mentioned mouse-over incident is that the mouse indication point takes place when the regulation zone stop certain hour of above-mentioned webpage is above.
16. according to claim 12 or 13 described systems, wherein,
The identifier of above-mentioned webpage is URL.
17. system according to claim 13, wherein,
Only store the up-to-date information of the scope of extracting out about above-mentioned text in the above-mentioned text extract information database.
18. according to claim 12 or 13 described systems, wherein,
In above-mentioned extraction range of text determination portion, determine to be to use the MSAA mode, also be to use the IHTML mode to extract the text of above-mentioned webpage out.
19. one kind becomes the system of sound with text transform, it is characterized in that this system comprises:
Text indication point identification part, the text indication point on the identification webpage;
Text is extracted the range information confirmation unit out, confirms the relevant information with the text extraction scope of the corresponding storage of at least a portion of the identifier of above-mentioned webpage;
Text is extracted the scope determination portion out, based on the relevant information that the position and the above-mentioned confirmed text of above-mentioned text indication point are extracted scope out, determines text extraction scope;
The text extraction unit is extracted above-mentioned fixed text out and is extracted the text of scope out;
The voice data generating unit generates the voice data that is associated with the text of above-mentioned extraction,
Wherein, the above-mentioned text relevant information of extracting scope out comprises and is used for determining extracting word, sentence, paragraph and which information in full out according to above-mentioned webpage characteristic.
20. one kind becomes the system of sound with text transform, it is characterized in that this system comprises:
Text indication point identification part, the text indication point on the identification webpage;
Text is extracted the range information acceptance division out, whether affirmation is storing the relevant information that the text corresponding with at least a portion of the identifier of above-mentioned webpage extracted scope out in text extract information database, if do not store, then receive the information of the scope of extracting out about above-mentioned text;
Text is extracted the scope determination section out, based on the relevant information that the position and the above-mentioned text that receives of above-mentioned text indication point are extracted scope out, determines text extraction scope;
The text extraction unit is extracted the above-mentioned text of having determined text extraction scope out;
The voice data generating unit generates the voice data that is associated with the text of above-mentioned extraction,
Wherein, the above-mentioned text relevant information of extracting scope out comprises and is used for determining extracting word, sentence, paragraph and which information in full out according to above-mentioned webpage characteristic.
21. according to claim 19 or 20 described systems, wherein,
The voice data that generates in tut data generating unit is corresponding with the text of above-mentioned extraction.
22. according to claim 19 or 20 described systems, wherein,
It is corresponding that the voice data that generates in tut data generating unit and the text with above-mentioned extraction have carried out the text translated.
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020070119406 | 2007-11-21 | ||
KR10-2007-0119406 | 2007-11-21 | ||
KR1020070119406A KR100958934B1 (en) | 2007-11-21 | 2007-11-21 | Method, system and computer-readable recording medium for extracting text based on characteristic of web page |
Publications (2)
Publication Number | Publication Date |
---|---|
CN101441648A CN101441648A (en) | 2009-05-27 |
CN101441648B true CN101441648B (en) | 2011-12-14 |
Family
ID=40726086
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2008101770713A Active CN101441648B (en) | 2007-11-21 | 2008-11-19 | Method and system based on webpage characteristic abstraction text |
Country Status (3)
Country | Link |
---|---|
JP (1) | JP4907635B2 (en) |
KR (1) | KR100958934B1 (en) |
CN (1) | CN101441648B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101363155B1 (en) * | 2009-08-04 | 2014-02-14 | 배경아 | system and method for recogniting and searching the text included image area that pointed by a pointing device |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1952929A (en) * | 2005-10-20 | 2007-04-25 | 关涛 | Extraction method and system of structured data of internet based on sample & faced to regime |
CN1991749A (en) * | 2005-12-31 | 2007-07-04 | 腾讯科技(深圳)有限公司 | Personal information management method based on personal information management software |
CN101042620A (en) * | 2006-03-20 | 2007-09-26 | 三星电子株式会社 | Pointing input device, method, and system using image pattern |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2003505756A (en) * | 1999-05-28 | 2003-02-12 | インデックス システムズ インコーポレイテッド | Method and system for using selected text on a web page for searching a database of television programs |
KR20010099529A (en) * | 2000-04-27 | 2001-11-09 | 이장욱 | Method of Providing Information on the Web Page in the Internet TV Terminal |
JP2003248613A (en) * | 2001-11-20 | 2003-09-05 | Sharp Corp | Information distributing system and distributed information creating device used therein |
KR100451739B1 (en) * | 2002-01-21 | 2004-10-08 | 엘지전자 주식회사 | Internet TV and Method for Display Text of The Same |
-
2007
- 2007-11-21 KR KR1020070119406A patent/KR100958934B1/en active IP Right Grant
-
2008
- 2008-11-19 JP JP2008295183A patent/JP4907635B2/en not_active Expired - Fee Related
- 2008-11-19 CN CN2008101770713A patent/CN101441648B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1952929A (en) * | 2005-10-20 | 2007-04-25 | 关涛 | Extraction method and system of structured data of internet based on sample & faced to regime |
CN1991749A (en) * | 2005-12-31 | 2007-07-04 | 腾讯科技(深圳)有限公司 | Personal information management method based on personal information management software |
CN101042620A (en) * | 2006-03-20 | 2007-09-26 | 三星电子株式会社 | Pointing input device, method, and system using image pattern |
Also Published As
Publication number | Publication date |
---|---|
KR20090052757A (en) | 2009-05-26 |
JP4907635B2 (en) | 2012-04-04 |
CN101441648A (en) | 2009-05-27 |
JP2009129456A (en) | 2009-06-11 |
KR100958934B1 (en) | 2010-05-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11134153B2 (en) | System and method for managing a dialog between a contact center system and a user thereof | |
CN110753908B (en) | Facilitating user device and/or proxy device actions during a communication session | |
US9530415B2 (en) | System and method of providing speech processing in user interface | |
US9910849B2 (en) | System and method for mixed-language support for applications | |
US8527260B2 (en) | User-configurable translations for electronic documents | |
CN100568241C (en) | Be used for concentrating the method and system of Content Management | |
WO2019070747A1 (en) | Providing command bundle suggestions for an automated assistant | |
US20140019128A1 (en) | Voice Based System and Method for Data Input | |
WO2021086870A1 (en) | Systems and methods for predicting and providing automated online chat assistance | |
US20070073756A1 (en) | System and method configuring contextual based content with published content for display on a user interface | |
JP7293643B2 (en) | A semi-automated method, system, and program for translating the content of structured documents into chat-based interactions | |
CN102754112A (en) | Social network media sharing with client library | |
TW201203082A (en) | Client application and web page integration | |
CN101641688B (en) | Definable application assistant | |
CN102567455A (en) | Method and system of managing documents using weighted prevalence data for statements | |
US20220309059A1 (en) | Method of and System for Causing a Smart Connected Device to Execute Content Upon Sensing a Link Trigger | |
US20240281410A1 (en) | Multi-service business platform system having custom workflow actions systems and methods | |
US10810273B2 (en) | Auto identification and mapping of functional attributes from visual representation | |
CN101441648B (en) | Method and system based on webpage characteristic abstraction text | |
KR102596036B1 (en) | System for providing artificial intelligence based unmanned store management service | |
TWM545995U (en) | System of taking queue number by voice | |
CN109154899B (en) | Interactive framework for executing user instructions with online services | |
US8429541B1 (en) | Method and system for video sharing between users of an application | |
US11886816B2 (en) | Bot dialog manager | |
US20240095448A1 (en) | Automatic guidance to interactive entity matching natural language input |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |