CN105912661A - Method and apparatus for removing html tag from search engine - Google Patents

Method and apparatus for removing html tag from search engine Download PDF

Info

Publication number
CN105912661A
CN105912661A CN201610222050.3A CN201610222050A CN105912661A CN 105912661 A CN105912661 A CN 105912661A CN 201610222050 A CN201610222050 A CN 201610222050A CN 105912661 A CN105912661 A CN 105912661A
Authority
CN
China
Prior art keywords
html label
data source
html
label
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610222050.3A
Other languages
Chinese (zh)
Inventor
谢晓静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
LeTV Holding Beijing Co Ltd
LeTV Information Technology Beijing Co Ltd
Original Assignee
LeTV Holding Beijing Co Ltd
LeTV Information Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by LeTV Holding Beijing Co Ltd, LeTV Information Technology Beijing Co Ltd filed Critical LeTV Holding Beijing Co Ltd
Priority to CN201610222050.3A priority Critical patent/CN105912661A/en
Publication of CN105912661A publication Critical patent/CN105912661A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9562Bookmark management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/972Access to data in other repository systems, e.g. legacy data or dynamic Web page generation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Embodiments of the present invention provide a method and apparatus for removing a html tag from a search engine. The method comprises removing a html tag from a data source, which is formed after a user edits contents on a website and includes the html tag, before a website server processes the data source; performing semantic word segmentation on the data source from which the html tag is removed; and storing the contents after word segmentation into a search database maintained by the website.

Description

Search engine is removed the method and device of html label
Technical field
The present invention relates to web search technical field, particularly relate to hypertext mark removed by a kind of search engine The method of note language html label.
Background technology
User is searching for content when, and part searches system there will be the pass of content and the input searched out The incongruent phenomenon of keyword.Such as search " blog " this key word, finds a lot of, but here Face can not find this word of blog at all.
This is because in data source, originally having one piece of content in search content is that user can pass through rich text The content of editor, this content incorporates data base as data source.That is this partial content be with Html pattern.And the most just have in the class attribute of a label have a blog thus just by Search is out.
Prior art use solr search engine solve the problems referred to above.The scheme of solr is from number The html label removing certain field is just set according to storehouse when of fetching data.But, solr official version Update frequently, will upgraded version every one or two months.Version is the most unstable, in some version Transformer can lose efficacy.It addition, this settling mode is just for this kind of search engine, by property too Narrow, still there will be the problems referred to above during the search of a lot of search engines.
Summary of the invention
The embodiment of the present invention provides the method and device removing html label in a kind of search engine, can go Except the html label in data source, and there is versatility.
One aspect of the embodiment of the present invention is to provide in a kind of search engine the method removing html label The method updating CMS fragment, including:
Receive the data source comprising html label that user is formed after content of edit on website, and remove institute State the html label in data source;
Data source after removing described html label carries out semantic participle;
Content after participle is stored in the search data base of Website server.
Optionally, the html label in the described data source of described removal, including:
Prescribed form according to html label searches html label in described data source, and removes described Html label.
Optionally, the described prescribed form according to html label searches html label in described data source, And remove described html label, specifically include:
The data source comprising html label is read in reusable character string text buffer, in this character string In text buffer, the prescribed form according to html label is carried out in canonical coupling this data source of removal Html label.
Optionally, the method also includes:
Send for prompting the user to choose whether to remove carrying of html label in described data source to user Show message.
Another aspect of the embodiment of the present invention is to provide in a kind of search engine the dress removing html label Put, including:
Remove module, for receiving the number comprising html label that user is formed after content of edit on website According to source, and remove the html label in described data source;
Word-dividing mode, the data source after removing described html label carries out semantic participle;
Memory module, in the search data base that the content after participle stores Website server.
Optionally, described removal module, specifically for the prescribed form according to html label in described data Source is searched html label, and removes described html label.
Optionally, described removal module, specifically for reading in reusable by the data source comprising html label Character string text buffer, enter according to the prescribed form of html label in this character string text buffer Row canonical coupling removes the html label in this data source.
Optionally, this device also includes:
Reminding module, for sending for prompting the user whether to remove the html in described data source to user The information of label.
The search engine that the embodiment of the present invention provides is removed the method and device of html label, by using Before the data source comprising html label is stored into search data base, it is carried out the technology hands except label Section, solves and comprises html label in prior art in factor data source and cause Search Results not comprise key The technical problem of word, and then achieve the impact removing html label to retrieval result, the skill of highly versatile Art effect.
Accompanying drawing explanation
A kind of search engine that Fig. 1 provides for the embodiment of the present invention is removed the method flow of html label Figure;
The another kind of search engine that Fig. 2 provides for the embodiment of the present invention is removed the method flow of html label Figure;
A kind of search engine that Fig. 3 provides for the embodiment of the present invention is removed the structure of the device of html label Schematic diagram.
Detailed description of the invention
For problem of the prior art, remove html label regular expression and just can accomplish.Key is asked Topic is in which step to do.Go up on a large scale, be divided into when indexing when doing and search for and doing.Certainly rope is built It is the highest for doing efficiency when of drawing.At that time be solr search engine, solr search engine just has Then expression formula filtering function, but this step to be done after having divided word, and after having divided word, html label is also Can because of semantic participle divided scattered here and there, regular expression cannot be used again.In view of this problem, The settling mode that these search engines existing carry the most all can't resolve.Chinese is the most complicated, foreign country It is exactly to a point this thinking substantially according to space that software consideration problem is namely based on their segmenter participle, To Chinese the most inapplicable.
After data base takes out, before participle, special html label of removing first can also be accomplished, but effect Rate is low, creates what too many vocabulary was caused after being also due to semantic participle.Then consider and directly exist Make an amendment on the source code of conventional IK segmenter, IK segmenter be with the addition of removal html label function. Also can set flag bit, user can choose whether to remove html label.
Based on above-mentioned analysis, the embodiment of the present invention provides the side removing html label in a kind of search engine Method, as it is shown in figure 1, the method is suitable for being deployed in segmenter, the method includes:
101, receive the data source comprising html label that user is formed after content of edit on website, and Remove the html label in described data source;
It is editor that user inputs the more commonly used during information on website, and this editing machine is a kind of Browser can be embedded in, the text editor of What You See Is What You Get.Therefore, the data such as the document that user inputs Source contents will comprise html label.
102, the data source after removing described html label carries out semantic participle;
Remove html label, it is simply that delete this html label, can be accomplished in that
Prescribed form according to html label searches html label in described data source, and removes described Html label.
More specifically include: the data source comprising html label is read in reusable character string text buffer District, in this character string text buffer prescribed form according to html label carry out that canonical coupling removes should Html label in data source.
Owing to the prescribed form of html label is with<>, therefore it is easy to find html to mark according to this form Sign, and delete.
103, the content after participle is stored in the search data base of Website server.
The method removing html label in the search engine that the embodiment of the present invention provides, by using comprising Before the data source processing of html label is stored into search data base, it is carried out the technological means except label, Solve and prior art comprises in factor data source html label cause Search Results not comprise keyword Technical problem, and then achieve remove html label on retrieval result impact, the technology of highly versatile Effect.
The embodiment of the present invention continues to provide a kind of method removing html label in search engine, such as Fig. 2 institute Show, including:
201, the data source comprising html label that user is formed after content of edit on website passes through network It is transferred on Website server.
202, before segmenter carries out semantic participle to this data source, send to user and be used for prompting the user whether Remove the information of html label in described data source.
Such as: in IK segmenter, before carrying out semantic segmentation, first it is removed the work of html label Make.Accept the flag bit being transmitted through in adapter, user can independently be chosen whether to remove html label.
203, if user's selective removal html label, then the data source comprising html label is read by segmenter Enter reusable character string text buffer, rule according to html label in this character string text buffer The formula of fixing carries out canonical coupling and removes the html label in this data source.
IK segmenter, the most invoked time, needs loading text inlet flow (i.e. data source), Enter text into stream at this moment and read in reusable character string text buffer, carry out canonical coupling and remove Word segmentation processing is carried out after html label.
204, segmenter carries out semantic participle to the data source after removing described html label;
205, the content after participle is stored in the search data base of Website server.
Such as: the data source that user inputs on website:
<p class=”blog context”>have deep love for China and have deep love for party</p>
After using this method, result after participle:
Have deep love for China and have deep love for party
(without html label)
For the ease of the realization of said method, the present embodiment also provides for removing in a kind of search engine html mark The device signed, as it is shown on figure 3, include:
Remove module 31, for receive user on website after content of edit formed comprise html label Data source, and remove the html label in described data source;
Word-dividing mode 32, the data source after removing described html label carries out semantic participle;
Memory module 33, in the search data base that the content after participle stores Website server.
Wherein, before before processing, specially segmenter carries out semantic participle.
Optionally, described removal module 31, specifically for the prescribed form according to html label described Data source is searched html label, and removes described html label.
Optionally, described removal module 31, can specifically for the data source comprising html label is read in The character string text buffer reused, regulation lattice according to html label in this character string text buffer Formula carries out canonical coupling and removes the html label in this data source.
Optionally, this device also includes: reminding module, is used for prompting the user whether for sending to user Remove the information of html label in described data source.
The search engine that the embodiment of the present invention provides is removed the device of html label, has comprising html Before the data source processing of label is stored in search data base, it is carried out the function except label, solves existing Comprising html label in technology in factor data source causes Search Results not comprise the technical problem of keyword, And then achieve the impact removing html label to retrieval result, the technique effect of highly versatile.
In several embodiments provided by the present invention, it should be understood that disclosed apparatus and method, Can realize by another way.Such as, device embodiment described above is only schematically, Such as, the division of described unit, it is only a kind of logic function and divides, actual can have additionally when realizing Dividing mode, the most multiple unit or assembly can in conjunction with or be desirably integrated into another system, or Some features can be ignored, or does not performs.Another point, shown or discussed coupling each other or Direct-coupling or communication connection can be the INDIRECT COUPLING by some interfaces, device or unit or communication link Connect, can be electrical, machinery or other form.
The described unit illustrated as separating component can be or may not be physically separate, makees The parts shown for unit can be or may not be physical location, i.e. may be located at a place, Or can also be distributed on multiple NE.Can select according to the actual needs part therein or The whole unit of person realizes the purpose of the present embodiment scheme.
It addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, Can also be that unit is individually physically present, it is also possible to two or more unit are integrated in a list In unit.Above-mentioned integrated unit both can realize to use the form of hardware, it would however also be possible to employ hardware adds software The form of functional unit realizes.
The above-mentioned integrated unit realized with the form of SFU software functional unit, can be stored in a computer In read/write memory medium.Above-mentioned SFU software functional unit is stored in a storage medium, including some fingers Make with so that a computer equipment (can be personal computer, server, or the network equipment etc.) Or processor (processor) performs the part steps of method described in each embodiment of the present invention.And it is aforementioned Storage medium include: USB flash disk, portable hard drive, read only memory (Read-Only Memory, ROM), Random access memory (Random Access Memory, RAM), magnetic disc or CD etc. are various permissible The medium of storage program code.
Those skilled in the art are it can be understood that arrive, for convenience and simplicity of description, only with above-mentioned respectively The division of functional module is illustrated, and in actual application, can above-mentioned functions be divided as desired Join and completed by different functional modules, the internal structure of device will be divided into different functional modules, with Complete all or part of function described above.The specific works process of the device of foregoing description is permissible With reference to the corresponding process in preceding method embodiment, do not repeat them here.
Last it is noted that various embodiments above is only in order to illustrate technical scheme, rather than right It limits;Although the present invention being described in detail with reference to foregoing embodiments, this area common Skilled artisans appreciate that the technical scheme described in foregoing embodiments still can be modified by it, Or the most some or all of technical characteristic is carried out equivalent;And these amendments or replacement, and The essence not making appropriate technical solution departs from the scope of various embodiments of the present invention technical scheme.

Claims (8)

1. the method removing html label in a search engine, it is characterised in that including:
Receive the data source comprising html label that user is formed after content of edit on website, and remove institute State the html label in data source;
Data source after removing described html label carries out semantic participle;
Content after participle is stored in the search data base of Website server.
Method the most according to claim 1, it is characterised in that in the described data source of described removal Html label, including:
Prescribed form according to html label searches html label in described data source, and removes described Html label.
Method the most according to claim 2, it is characterised in that the described rule according to html label The formula that fixes searches html label in described data source, and removes described html label, specifically includes:
The data source comprising html label is read in reusable character string text buffer, in this character string In text buffer, the prescribed form according to html label is carried out in canonical coupling this data source of removal Html label.
4. according to the method described in any one in claim 1-3, it is characterised in that the method is also wrapped Include:
Send for prompting the user to choose whether to remove carrying of html label in described data source to user Show message.
5. a search engine is removed the device of html label, it is characterised in that including:
Remove module, for receiving the number comprising html label that user is formed after content of edit on website According to source, and remove the html label in described data source;
Word-dividing mode, the data source after removing described html label carries out semantic participle;
Memory module, in the search data base that the content after participle stores Website server.
Device the most according to claim 5, it is characterised in that described removal module, specifically for Prescribed form according to html label searches html label in described data source, and removes described html Label.
Device the most according to claim 6, it is characterised in that described removal module, specifically for The data source comprising html label is read in reusable character string text buffer, at this character string text In relief area, the prescribed form according to html label carries out the html mark that canonical coupling is removed in this data source Sign.
8. according to the device described in any one in claim 5-7, it is characterised in that this device also wraps Include:
Reminding module, for sending for prompting the user whether to remove the html in described data source to user The information of label.
CN201610222050.3A 2016-04-11 2016-04-11 Method and apparatus for removing html tag from search engine Pending CN105912661A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610222050.3A CN105912661A (en) 2016-04-11 2016-04-11 Method and apparatus for removing html tag from search engine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610222050.3A CN105912661A (en) 2016-04-11 2016-04-11 Method and apparatus for removing html tag from search engine

Publications (1)

Publication Number Publication Date
CN105912661A true CN105912661A (en) 2016-08-31

Family

ID=56744910

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610222050.3A Pending CN105912661A (en) 2016-04-11 2016-04-11 Method and apparatus for removing html tag from search engine

Country Status (1)

Country Link
CN (1) CN105912661A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101308498A (en) * 2008-07-03 2008-11-19 上海交通大学 Text collection visualized system
CN102779169A (en) * 2012-06-27 2012-11-14 江苏新瑞峰信息科技有限公司 Extracting method and device for webpage content based on HTML (Hypertext Markup Language) label
CN104268283A (en) * 2014-10-21 2015-01-07 浪潮集团有限公司 Method for automatically analyzing Internet web page
CN104484387A (en) * 2014-12-10 2015-04-01 北京奇虎科技有限公司 Method for carrying out searching in browser and browser device
CN105183801A (en) * 2015-08-25 2015-12-23 北京信息科技大学 Web page body text extraction method and apparatus

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101308498A (en) * 2008-07-03 2008-11-19 上海交通大学 Text collection visualized system
CN102779169A (en) * 2012-06-27 2012-11-14 江苏新瑞峰信息科技有限公司 Extracting method and device for webpage content based on HTML (Hypertext Markup Language) label
CN104268283A (en) * 2014-10-21 2015-01-07 浪潮集团有限公司 Method for automatically analyzing Internet web page
CN104484387A (en) * 2014-12-10 2015-04-01 北京奇虎科技有限公司 Method for carrying out searching in browser and browser device
CN105183801A (en) * 2015-08-25 2015-12-23 北京信息科技大学 Web page body text extraction method and apparatus

Similar Documents

Publication Publication Date Title
CN110321432B (en) Text event information extraction method, electronic device and nonvolatile storage medium
US11030199B2 (en) Systems and methods for contextual retrieval and contextual display of records
US9633010B2 (en) Converting data into natural language form
JP6849741B2 (en) How and systems to perform model-driven domain-specific searches
US9514113B1 (en) Methods for automatic footnote generation
CA2833355C (en) System and method for automatic wrapper induction by applying filters
CN112651236B (en) Method and device for extracting text information, computer equipment and storage medium
KR101948257B1 (en) Multi-classification device and method using lsp
Jain et al. Context sensitive text summarization using k means clustering algorithm
CN114817298A (en) Method, device and equipment for extracting field-level data blood margin and storage medium
Sitaula A hybrid algorithm for stemming of Nepali text
Leonandya et al. A semi-supervised algorithm for Indonesian named entity recognition
US20210182549A1 (en) Natural Language Processing (NLP) Pipeline for Automated Attribute Extraction
CN105912661A (en) Method and apparatus for removing html tag from search engine
CN105426490A (en) Tree structure based indexing method
Luong et al. Word graph-based multi-sentence compression: Re-ranking candidates using frequent words
CN115203445A (en) Multimedia resource searching method, device, equipment and medium
CN104281695A (en) Combination theory based quasi natural language semantic information extraction method and system
CN111898762B (en) Deep learning model catalog creation
Thamviset et al. Bottom-up region extractor for semi-structured web pages
CN113779200A (en) Target industry word stock generation method, processor and device
CN112507108A (en) Knowledge extraction method and system based on json rule file and rule analysis engine
CN107818091B (en) Document processing method and device
Bartoli et al. Predicting the effectiveness of pattern-based entity extractor inference
JP2019153119A (en) Sentence extraction device and program

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20160831