CN105912661A - Method and apparatus for removing html tag from search engine - Google Patents
Method and apparatus for removing html tag from search engine Download PDFInfo
- Publication number
- CN105912661A CN105912661A CN201610222050.3A CN201610222050A CN105912661A CN 105912661 A CN105912661 A CN 105912661A CN 201610222050 A CN201610222050 A CN 201610222050A CN 105912661 A CN105912661 A CN 105912661A
- Authority
- CN
- China
- Prior art keywords
- html label
- data source
- html
- label
- user
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9562—Bookmark management
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
- G06F16/972—Access to data in other repository systems, e.g. legacy data or dynamic Web page generation
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Embodiments of the present invention provide a method and apparatus for removing a html tag from a search engine. The method comprises removing a html tag from a data source, which is formed after a user edits contents on a website and includes the html tag, before a website server processes the data source; performing semantic word segmentation on the data source from which the html tag is removed; and storing the contents after word segmentation into a search database maintained by the website.
Description
Technical field
The present invention relates to web search technical field, particularly relate to hypertext mark removed by a kind of search engine
The method of note language html label.
Background technology
User is searching for content when, and part searches system there will be the pass of content and the input searched out
The incongruent phenomenon of keyword.Such as search " blog " this key word, finds a lot of, but here
Face can not find this word of blog at all.
This is because in data source, originally having one piece of content in search content is that user can pass through rich text
The content of editor, this content incorporates data base as data source.That is this partial content be with
Html pattern.And the most just have in the class attribute of a label have a blog thus just by
Search is out.
Prior art use solr search engine solve the problems referred to above.The scheme of solr is from number
The html label removing certain field is just set according to storehouse when of fetching data.But, solr official version
Update frequently, will upgraded version every one or two months.Version is the most unstable, in some version
Transformer can lose efficacy.It addition, this settling mode is just for this kind of search engine, by property too
Narrow, still there will be the problems referred to above during the search of a lot of search engines.
Summary of the invention
The embodiment of the present invention provides the method and device removing html label in a kind of search engine, can go
Except the html label in data source, and there is versatility.
One aspect of the embodiment of the present invention is to provide in a kind of search engine the method removing html label
The method updating CMS fragment, including:
Receive the data source comprising html label that user is formed after content of edit on website, and remove institute
State the html label in data source;
Data source after removing described html label carries out semantic participle;
Content after participle is stored in the search data base of Website server.
Optionally, the html label in the described data source of described removal, including:
Prescribed form according to html label searches html label in described data source, and removes described
Html label.
Optionally, the described prescribed form according to html label searches html label in described data source,
And remove described html label, specifically include:
The data source comprising html label is read in reusable character string text buffer, in this character string
In text buffer, the prescribed form according to html label is carried out in canonical coupling this data source of removal
Html label.
Optionally, the method also includes:
Send for prompting the user to choose whether to remove carrying of html label in described data source to user
Show message.
Another aspect of the embodiment of the present invention is to provide in a kind of search engine the dress removing html label
Put, including:
Remove module, for receiving the number comprising html label that user is formed after content of edit on website
According to source, and remove the html label in described data source;
Word-dividing mode, the data source after removing described html label carries out semantic participle;
Memory module, in the search data base that the content after participle stores Website server.
Optionally, described removal module, specifically for the prescribed form according to html label in described data
Source is searched html label, and removes described html label.
Optionally, described removal module, specifically for reading in reusable by the data source comprising html label
Character string text buffer, enter according to the prescribed form of html label in this character string text buffer
Row canonical coupling removes the html label in this data source.
Optionally, this device also includes:
Reminding module, for sending for prompting the user whether to remove the html in described data source to user
The information of label.
The search engine that the embodiment of the present invention provides is removed the method and device of html label, by using
Before the data source comprising html label is stored into search data base, it is carried out the technology hands except label
Section, solves and comprises html label in prior art in factor data source and cause Search Results not comprise key
The technical problem of word, and then achieve the impact removing html label to retrieval result, the skill of highly versatile
Art effect.
Accompanying drawing explanation
A kind of search engine that Fig. 1 provides for the embodiment of the present invention is removed the method flow of html label
Figure;
The another kind of search engine that Fig. 2 provides for the embodiment of the present invention is removed the method flow of html label
Figure;
A kind of search engine that Fig. 3 provides for the embodiment of the present invention is removed the structure of the device of html label
Schematic diagram.
Detailed description of the invention
For problem of the prior art, remove html label regular expression and just can accomplish.Key is asked
Topic is in which step to do.Go up on a large scale, be divided into when indexing when doing and search for and doing.Certainly rope is built
It is the highest for doing efficiency when of drawing.At that time be solr search engine, solr search engine just has
Then expression formula filtering function, but this step to be done after having divided word, and after having divided word, html label is also
Can because of semantic participle divided scattered here and there, regular expression cannot be used again.In view of this problem,
The settling mode that these search engines existing carry the most all can't resolve.Chinese is the most complicated, foreign country
It is exactly to a point this thinking substantially according to space that software consideration problem is namely based on their segmenter participle,
To Chinese the most inapplicable.
After data base takes out, before participle, special html label of removing first can also be accomplished, but effect
Rate is low, creates what too many vocabulary was caused after being also due to semantic participle.Then consider and directly exist
Make an amendment on the source code of conventional IK segmenter, IK segmenter be with the addition of removal html label function.
Also can set flag bit, user can choose whether to remove html label.
Based on above-mentioned analysis, the embodiment of the present invention provides the side removing html label in a kind of search engine
Method, as it is shown in figure 1, the method is suitable for being deployed in segmenter, the method includes:
101, receive the data source comprising html label that user is formed after content of edit on website, and
Remove the html label in described data source;
It is editor that user inputs the more commonly used during information on website, and this editing machine is a kind of
Browser can be embedded in, the text editor of What You See Is What You Get.Therefore, the data such as the document that user inputs
Source contents will comprise html label.
102, the data source after removing described html label carries out semantic participle;
Remove html label, it is simply that delete this html label, can be accomplished in that
Prescribed form according to html label searches html label in described data source, and removes described
Html label.
More specifically include: the data source comprising html label is read in reusable character string text buffer
District, in this character string text buffer prescribed form according to html label carry out that canonical coupling removes should
Html label in data source.
Owing to the prescribed form of html label is with<>, therefore it is easy to find html to mark according to this form
Sign, and delete.
103, the content after participle is stored in the search data base of Website server.
The method removing html label in the search engine that the embodiment of the present invention provides, by using comprising
Before the data source processing of html label is stored into search data base, it is carried out the technological means except label,
Solve and prior art comprises in factor data source html label cause Search Results not comprise keyword
Technical problem, and then achieve remove html label on retrieval result impact, the technology of highly versatile
Effect.
The embodiment of the present invention continues to provide a kind of method removing html label in search engine, such as Fig. 2 institute
Show, including:
201, the data source comprising html label that user is formed after content of edit on website passes through network
It is transferred on Website server.
202, before segmenter carries out semantic participle to this data source, send to user and be used for prompting the user whether
Remove the information of html label in described data source.
Such as: in IK segmenter, before carrying out semantic segmentation, first it is removed the work of html label
Make.Accept the flag bit being transmitted through in adapter, user can independently be chosen whether to remove html label.
203, if user's selective removal html label, then the data source comprising html label is read by segmenter
Enter reusable character string text buffer, rule according to html label in this character string text buffer
The formula of fixing carries out canonical coupling and removes the html label in this data source.
IK segmenter, the most invoked time, needs loading text inlet flow (i.e. data source),
Enter text into stream at this moment and read in reusable character string text buffer, carry out canonical coupling and remove
Word segmentation processing is carried out after html label.
204, segmenter carries out semantic participle to the data source after removing described html label;
205, the content after participle is stored in the search data base of Website server.
Such as: the data source that user inputs on website:
<p class=”blog context”>have deep love for China and have deep love for party</p>
After using this method, result after participle:
Have deep love for China and have deep love for party
(without html label)
For the ease of the realization of said method, the present embodiment also provides for removing in a kind of search engine html mark
The device signed, as it is shown on figure 3, include:
Remove module 31, for receive user on website after content of edit formed comprise html label
Data source, and remove the html label in described data source;
Word-dividing mode 32, the data source after removing described html label carries out semantic participle;
Memory module 33, in the search data base that the content after participle stores Website server.
Wherein, before before processing, specially segmenter carries out semantic participle.
Optionally, described removal module 31, specifically for the prescribed form according to html label described
Data source is searched html label, and removes described html label.
Optionally, described removal module 31, can specifically for the data source comprising html label is read in
The character string text buffer reused, regulation lattice according to html label in this character string text buffer
Formula carries out canonical coupling and removes the html label in this data source.
Optionally, this device also includes: reminding module, is used for prompting the user whether for sending to user
Remove the information of html label in described data source.
The search engine that the embodiment of the present invention provides is removed the device of html label, has comprising html
Before the data source processing of label is stored in search data base, it is carried out the function except label, solves existing
Comprising html label in technology in factor data source causes Search Results not comprise the technical problem of keyword,
And then achieve the impact removing html label to retrieval result, the technique effect of highly versatile.
In several embodiments provided by the present invention, it should be understood that disclosed apparatus and method,
Can realize by another way.Such as, device embodiment described above is only schematically,
Such as, the division of described unit, it is only a kind of logic function and divides, actual can have additionally when realizing
Dividing mode, the most multiple unit or assembly can in conjunction with or be desirably integrated into another system, or
Some features can be ignored, or does not performs.Another point, shown or discussed coupling each other or
Direct-coupling or communication connection can be the INDIRECT COUPLING by some interfaces, device or unit or communication link
Connect, can be electrical, machinery or other form.
The described unit illustrated as separating component can be or may not be physically separate, makees
The parts shown for unit can be or may not be physical location, i.e. may be located at a place,
Or can also be distributed on multiple NE.Can select according to the actual needs part therein or
The whole unit of person realizes the purpose of the present embodiment scheme.
It addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit,
Can also be that unit is individually physically present, it is also possible to two or more unit are integrated in a list
In unit.Above-mentioned integrated unit both can realize to use the form of hardware, it would however also be possible to employ hardware adds software
The form of functional unit realizes.
The above-mentioned integrated unit realized with the form of SFU software functional unit, can be stored in a computer
In read/write memory medium.Above-mentioned SFU software functional unit is stored in a storage medium, including some fingers
Make with so that a computer equipment (can be personal computer, server, or the network equipment etc.)
Or processor (processor) performs the part steps of method described in each embodiment of the present invention.And it is aforementioned
Storage medium include: USB flash disk, portable hard drive, read only memory (Read-Only Memory, ROM),
Random access memory (Random Access Memory, RAM), magnetic disc or CD etc. are various permissible
The medium of storage program code.
Those skilled in the art are it can be understood that arrive, for convenience and simplicity of description, only with above-mentioned respectively
The division of functional module is illustrated, and in actual application, can above-mentioned functions be divided as desired
Join and completed by different functional modules, the internal structure of device will be divided into different functional modules, with
Complete all or part of function described above.The specific works process of the device of foregoing description is permissible
With reference to the corresponding process in preceding method embodiment, do not repeat them here.
Last it is noted that various embodiments above is only in order to illustrate technical scheme, rather than right
It limits;Although the present invention being described in detail with reference to foregoing embodiments, this area common
Skilled artisans appreciate that the technical scheme described in foregoing embodiments still can be modified by it,
Or the most some or all of technical characteristic is carried out equivalent;And these amendments or replacement, and
The essence not making appropriate technical solution departs from the scope of various embodiments of the present invention technical scheme.
Claims (8)
1. the method removing html label in a search engine, it is characterised in that including:
Receive the data source comprising html label that user is formed after content of edit on website, and remove institute
State the html label in data source;
Data source after removing described html label carries out semantic participle;
Content after participle is stored in the search data base of Website server.
Method the most according to claim 1, it is characterised in that in the described data source of described removal
Html label, including:
Prescribed form according to html label searches html label in described data source, and removes described
Html label.
Method the most according to claim 2, it is characterised in that the described rule according to html label
The formula that fixes searches html label in described data source, and removes described html label, specifically includes:
The data source comprising html label is read in reusable character string text buffer, in this character string
In text buffer, the prescribed form according to html label is carried out in canonical coupling this data source of removal
Html label.
4. according to the method described in any one in claim 1-3, it is characterised in that the method is also wrapped
Include:
Send for prompting the user to choose whether to remove carrying of html label in described data source to user
Show message.
5. a search engine is removed the device of html label, it is characterised in that including:
Remove module, for receiving the number comprising html label that user is formed after content of edit on website
According to source, and remove the html label in described data source;
Word-dividing mode, the data source after removing described html label carries out semantic participle;
Memory module, in the search data base that the content after participle stores Website server.
Device the most according to claim 5, it is characterised in that described removal module, specifically for
Prescribed form according to html label searches html label in described data source, and removes described html
Label.
Device the most according to claim 6, it is characterised in that described removal module, specifically for
The data source comprising html label is read in reusable character string text buffer, at this character string text
In relief area, the prescribed form according to html label carries out the html mark that canonical coupling is removed in this data source
Sign.
8. according to the device described in any one in claim 5-7, it is characterised in that this device also wraps
Include:
Reminding module, for sending for prompting the user whether to remove the html in described data source to user
The information of label.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610222050.3A CN105912661A (en) | 2016-04-11 | 2016-04-11 | Method and apparatus for removing html tag from search engine |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610222050.3A CN105912661A (en) | 2016-04-11 | 2016-04-11 | Method and apparatus for removing html tag from search engine |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105912661A true CN105912661A (en) | 2016-08-31 |
Family
ID=56744910
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610222050.3A Pending CN105912661A (en) | 2016-04-11 | 2016-04-11 | Method and apparatus for removing html tag from search engine |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105912661A (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101308498A (en) * | 2008-07-03 | 2008-11-19 | 上海交通大学 | Text collection visualized system |
CN102779169A (en) * | 2012-06-27 | 2012-11-14 | 江苏新瑞峰信息科技有限公司 | Extracting method and device for webpage content based on HTML (Hypertext Markup Language) label |
CN104268283A (en) * | 2014-10-21 | 2015-01-07 | 浪潮集团有限公司 | Method for automatically analyzing Internet web page |
CN104484387A (en) * | 2014-12-10 | 2015-04-01 | 北京奇虎科技有限公司 | Method for carrying out searching in browser and browser device |
CN105183801A (en) * | 2015-08-25 | 2015-12-23 | 北京信息科技大学 | Web page body text extraction method and apparatus |
-
2016
- 2016-04-11 CN CN201610222050.3A patent/CN105912661A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101308498A (en) * | 2008-07-03 | 2008-11-19 | 上海交通大学 | Text collection visualized system |
CN102779169A (en) * | 2012-06-27 | 2012-11-14 | 江苏新瑞峰信息科技有限公司 | Extracting method and device for webpage content based on HTML (Hypertext Markup Language) label |
CN104268283A (en) * | 2014-10-21 | 2015-01-07 | 浪潮集团有限公司 | Method for automatically analyzing Internet web page |
CN104484387A (en) * | 2014-12-10 | 2015-04-01 | 北京奇虎科技有限公司 | Method for carrying out searching in browser and browser device |
CN105183801A (en) * | 2015-08-25 | 2015-12-23 | 北京信息科技大学 | Web page body text extraction method and apparatus |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110321432B (en) | Text event information extraction method, electronic device and nonvolatile storage medium | |
US11030199B2 (en) | Systems and methods for contextual retrieval and contextual display of records | |
US9633010B2 (en) | Converting data into natural language form | |
JP6849741B2 (en) | How and systems to perform model-driven domain-specific searches | |
US9514113B1 (en) | Methods for automatic footnote generation | |
CA2833355C (en) | System and method for automatic wrapper induction by applying filters | |
CN112651236B (en) | Method and device for extracting text information, computer equipment and storage medium | |
KR101948257B1 (en) | Multi-classification device and method using lsp | |
Jain et al. | Context sensitive text summarization using k means clustering algorithm | |
CN114817298A (en) | Method, device and equipment for extracting field-level data blood margin and storage medium | |
Sitaula | A hybrid algorithm for stemming of Nepali text | |
Leonandya et al. | A semi-supervised algorithm for Indonesian named entity recognition | |
US20210182549A1 (en) | Natural Language Processing (NLP) Pipeline for Automated Attribute Extraction | |
CN105912661A (en) | Method and apparatus for removing html tag from search engine | |
CN105426490A (en) | Tree structure based indexing method | |
Luong et al. | Word graph-based multi-sentence compression: Re-ranking candidates using frequent words | |
CN115203445A (en) | Multimedia resource searching method, device, equipment and medium | |
CN104281695A (en) | Combination theory based quasi natural language semantic information extraction method and system | |
CN111898762B (en) | Deep learning model catalog creation | |
Thamviset et al. | Bottom-up region extractor for semi-structured web pages | |
CN113779200A (en) | Target industry word stock generation method, processor and device | |
CN112507108A (en) | Knowledge extraction method and system based on json rule file and rule analysis engine | |
CN107818091B (en) | Document processing method and device | |
Bartoli et al. | Predicting the effectiveness of pattern-based entity extractor inference | |
JP2019153119A (en) | Sentence extraction device and program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20160831 |