CN108733687A - A kind of information retrieval method and system based on Text region - Google Patents

A kind of information retrieval method and system based on Text region Download PDF

Info

Publication number
CN108733687A
CN108733687A CN201710251901.1A CN201710251901A CN108733687A CN 108733687 A CN108733687 A CN 108733687A CN 201710251901 A CN201710251901 A CN 201710251901A CN 108733687 A CN108733687 A CN 108733687A
Authority
CN
China
Prior art keywords
language
linear structure
word
content
block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710251901.1A
Other languages
Chinese (zh)
Inventor
陈伯妤
姜蓓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201710251901.1A priority Critical patent/CN108733687A/en
Publication of CN108733687A publication Critical patent/CN108733687A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/55Push-based network services

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Embodiment of the present invention discloses a kind of information retrieval method and system based on Text region.Method includes:Word content is shot, and sends shooting photo to high in the clouds;High in the clouds executes optical character identification so that the shooting photo is converted to word content to the shooting photo, retrieves related content based on the word content, and push the retrieval result for including the related content.After embodiment of the present invention, user's quick-searching can be facilitated, and play relevant multimedia messages.

Description

A kind of information retrieval method and system based on Text region
Technical field
The present invention relates to search fields.More particularly, to a kind of information retrieval method and system based on Text region.
Background technology
Natural language processing (Information Retrieval) refers to that information is organized in a certain way, and root According to the process and technology for needing to find out related information of information user.The natural language processing of narrow sense is exactly natural language processing The latter half of process, i.e., find out the process of required information from information aggregate, that is, our Information searchings for often saying (Information Search or Information Seek).
Currently used natural language processing method generally includes:Common law, retroactive method and discrete method etc..Common law is profit The method for carrying out documents searching with gophers such as bibliography, digest, indexes.The key of fortune in this way is to be familiar with each Property, feature and the search procedure of kind gopher, are searched from different perspectives.Common law can be divided into again along inspection method and the method for falling inspection. Along inspection method retrieved in chronological order till now from the past, expense is more, efficiency is low;Inspection method is inverse time sequencing from the recent period to remote Phase is retrieved, it emphasizes recent data, payes attention to current information, and initiative is strong, and effect is preferable.Retroactive method is to utilize existing document institute The method that attached bibliography constantly tracks lookup, not full-time in no gopher or gopher, this method can get and be directed to Property is very strong.
Since stepping into 21 century, with the fast development of Internet (Internet) and adding for world economic integration Speed, the network information drastically expand, and international exchange is increasingly frequent, and information is retrieved to assist people's quick obtaining to believe by network Breath, has become inevitable trend.
Invention content
Embodiment of the present invention proposes a kind of information retrieval method and system based on Text region, with easily to user Feedback information.
What the technical solution of embodiment of the present invention was realized in:
A kind of information retrieval method based on Text region, this method include:Word content is shot, and sends shooting photo To high in the clouds;High in the clouds executes optical character identification so that the shooting photo is converted to word content to the shooting photo, is based on The word content retrieves related content, and pushes the retrieval result for including the related content.
In one embodiment, the shooting word content, and send shooting photo and be to high in the clouds:It is set using wearable Standby shooting word content, and shooting photo is sent to high in the clouds using the wearable device;
It is described that related content is retrieved based on word content, and push the retrieval result comprising the related content and include:Cloud End group retrieves relevant audio file or video file in the word content, and the audio file or video file are sent To the wearable device;
This method further includes:The wearable device plays the audio file or video file.
In one embodiment, the shooting word content, and send shooting photo and be to high in the clouds:Utilize mobile terminal Word content is shot, and shooting photo is sent to high in the clouds using the mobile terminal;
It is described that related content is retrieved based on word content, and push the retrieval result comprising the related content and include:Cloud End group retrieves relevant audio file or video file in the word content, and the audio file or video file are sent To the mobile terminal;
This method further includes:Audio file or video file described in the mobile terminal playing.
In one embodiment, this method also includes in advance:The word of chapter grade is word using symbol cutting by high in the clouds Symbol string, and extracts language linear structure and language block from the character string cut out, respectively to the language linear structure that extracts with And language block is arranged, and creates language linear structure subindex and language block subindex, and by language linear structure subindex and Language block subindex is merged, to form whole index;It is described that related content is retrieved based on the word content, and push and include The retrieval result of the related content includes:The language linear structure and language block of the word content are extracted, and according to described The information that whole index push matches with the language linear structure and language block extracted from the word content.
In one embodiment, the word by chapter grade includes for character string using symbol cutting:Using based on The local substring statistical form of chapter uses the cutting route tree based on multi-path planning as dividing die as interim auxiliary dictionary The word character code of the chapter grade is uniformly converted to UTF-8 coded formats by type;After being converted to UTF-8 coded formats Chapter grade word using symbol cutting be character string;It is extracted with from the word content according to the whole index push The information that the language linear structure and language block gone out matches includes:According to language linear structure and language block matching degree by height to Low sequence pushes the information that the language linear structure that is extracted from the word content and language block match, wherein when from this It is described when the repetition number of words of the language linear structure extracted in word content and the language linear structure in whole index is more Matching degree is higher.
In one embodiment, this method further includes:It pre-sets language linear structure and repeats weight and language block repetition Weight;The language linear structure and entirety that weight calculation is extracted from the word content are repeated based on the language linear structure First overlapping index of the language linear structure in index, and weight calculation is repeated based on language block and is extracted from the word content Language block and the language block in whole index the second Chong Die index;When it is described first overlapping index index Chong Die with second and get over Height, the matching degree are higher.
A kind of information retrieval system based on Text region, the system include:Filming apparatus, for shooting word content, And shooting photo is sent to high in the clouds;Searching system is located at high in the clouds, for executing optical character identification to the shooting photo to incite somebody to action The shooting photo is converted to word content, retrieves related content based on the word content, and include to filming apparatus push The retrieval result of the related content.
In one embodiment, the filming apparatus is wearable device;Searching system, for based in the word Hold and retrieve relevant audio file or video file, and the audio file or video file are sent to described wearable set It is standby;The wearable device is additionally operable to play the audio file or video file.
In one embodiment, the filming apparatus is mobile terminal;Searching system, for being based on the word content Relevant audio file or video file are retrieved, and the audio file or video file are sent to the mobile terminal;Institute Mobile terminal is stated, is additionally operable to play the audio file or video file.
In one embodiment, searching system is additionally operable to be in advance character using symbol cutting by the word of chapter grade String, and extracts language linear structure and language block from the character string cut out, respectively to the language linear structure that extracts and Language block is arranged, and creates language linear structure subindex and language block subindex, and by language linear structure subindex and language Block subindex is merged, to form whole index;It is described that related content is retrieved based on the word content, and it includes institute to push The retrieval result for stating related content includes:Extract the language linear structure and language block of the word content, and according to described whole The information that the push of body index matches with the language linear structure and language block extracted from the word content;Wherein by chapter grade Word include for character string using symbol cutting:Using the local substring statistical form based on chapter as interim auxiliary dictionary, It uses the cutting route tree based on multi-path planning as segmentation model, the word character code of the chapter grade is uniformly converted to UTF-8 coded formats;It is character string to being converted to the word of the chapter grade after UTF-8 coded formats using symbol cutting;According to The whole index pushes the information to match with the language linear structure and language block extracted from the word content and includes: According to the sequence of the matching degree of language linear structure and language block from high to low, the language extracted from the word content is pushed The information that linear structure and language block match, wherein when the language linear structure extracted from the word content is indexed with whole In language linear structure repetition number of words it is more when, the matching degree is higher.
In embodiments of the present invention, word content is shot, and sends shooting photo to high in the clouds;The shooting is shone in high in the clouds Piece executes optical character identification the shooting photo is converted to word content, inside the Pass based on word content retrieval phase Hold, and pushes the retrieval result for including the related content.After embodiment of the present invention, user can be facilitated quickly to examine Rope, and play relevant multimedia messages.As it can be seen that embodiment of the present invention, which can be based on identification content, executes retrieval, convenient for using Family quick search.Furthermore, it is possible to be played and the relevant audio file of word content of shooting or video text using wearable device Part intuitively obtains information convenient for user.Moreover, user can also utilize access entry to play relevant multimedia messages, it is convenient for User uses.
Description of the drawings
Fig. 1 is the information retrieval method flow chart based on Text region according to embodiment of the present invention.
Fig. 2 is the schematic diagram that information retrieval is realized using wearable glasses according to embodiment of the present invention.
Fig. 3 is the signal that the multimedia file retrieved is played using wearable glasses according to embodiment of the present invention Figure.
Fig. 4 is the information retrieval schematic diagram based on mobile terminal according to embodiment of the present invention.
Fig. 5 is the natural language processing method flow diagram based on semantics identity according to embodiment of the present invention;
Fig. 6 is the natural language processing device structure chart based on semantics identity according to embodiment of the present invention;
Fig. 7 is the natural language processing system structure chart based on semantics identity according to embodiment of the present invention.
Specific implementation mode
To make the object, technical solutions and advantages of the present invention express to be more clearly understood, below in conjunction with the accompanying drawings and specifically The present invention is further described in more detail for embodiment.
In embodiments of the present invention, the content of text such as books are shot by equipment such as mobile phones, is uploaded to the automatic knowledge in backstage Matching books or other multimedia documents are not retrieved.Retrieval result may be there are a plurality of, when displaying by correlation by High to Low sequence.User can click books by retrieval result list, open book content using mobile phone browser and read, use Family can also play corresponding audio or video by wearable device.
Fig. 1 is the information retrieval method flow chart based on Text region according to embodiment of the present invention.As shown in Figure 1, This method includes:
Step 101:Word content is shot, and sends shooting photo to high in the clouds;
Step 102:High in the clouds executes optical character identification (OCR) to be converted to the shooting photo to the shooting photo Word content retrieves related content based on the word content, and pushes the retrieval result for including the related content.
In one embodiment, the shooting word content, and send shooting photo and be to high in the clouds:It is set using wearable Standby shooting word content, and shooting photo is sent to high in the clouds using the wearable device;It is described that phase is retrieved based on word content Hold inside the Pass, and pushes the retrieval result comprising the related content and include:High in the clouds is based on the word content and retrieves relevant sound Frequency file or video file, and the audio file or video file are sent to the wearable device;This method further includes: The wearable device plays the audio file or video file.
Wherein, wearable device specifically may be embodied as wearable glasses, wearable wrist-watch, wearable bracelet, wearable Waistband, etc..
In one embodiment, the shooting word content, and send shooting photo and be to high in the clouds:Utilize mobile terminal Word content is shot, and shooting photo is sent to high in the clouds using the mobile terminal;It is described to be retrieved mutually inside the Pass based on word content Hold, and pushes the retrieval result comprising the related content and include:High in the clouds is based on the word content and retrieves relevant audio text Part or video file, and the audio file or video file are sent to the mobile terminal;This method further includes:The shifting Audio file or video file described in dynamic terminal plays.
In one embodiment, this method also includes in advance:The word of chapter grade is word using symbol cutting by high in the clouds Symbol string, and extracts language linear structure and language block from the character string cut out, respectively to the language linear structure that extracts with And language block is arranged, and creates language linear structure subindex and language block subindex, and by language linear structure subindex and Language block subindex is merged, to form whole index;It is described that related content is retrieved based on the word content, and push and include The retrieval result of the related content includes:The language linear structure and language block of the word content are extracted, and according to described The information that whole index push matches with the language linear structure and language block extracted from the word content.
In one embodiment, the word by chapter grade includes for character string using symbol cutting:Using based on The local substring statistical form of chapter uses the cutting route tree based on multi-path planning as dividing die as interim auxiliary dictionary The word character code of the chapter grade is uniformly converted to UTF-8 coded formats by type;After being converted to UTF-8 coded formats Chapter grade word using symbol cutting be character string;It is extracted with from the word content according to the whole index push The information that the language linear structure and language block gone out matches includes:According to language linear structure and language block matching degree by height to Low sequence pushes the information that the language linear structure that is extracted from the word content and language block match, wherein when from this It is described when the repetition number of words of the language linear structure extracted in word content and the language linear structure in whole index is more Matching degree is higher.
In one embodiment, this method further includes:It pre-sets language linear structure and repeats weight and language block repetition Weight;The language linear structure and entirety that weight calculation is extracted from the word content are repeated based on the language linear structure First overlapping index of the language linear structure in index, and weight calculation is repeated based on language block and is extracted from the word content Language block and the language block in whole index the second Chong Die index;When it is described first overlapping index index Chong Die with second and get over Height, the matching degree are higher.
Fig. 2 is the schematic diagram that information retrieval is realized using wearable glasses according to embodiment of the present invention.According to Fig. 3 The schematic diagram that the multimedia file retrieved is played using wearable glasses of embodiment of the present invention.In Fig. 2 and shown in Fig. 3 Wearable glasses, which have, takes pictures and network savvy, and has media play function.Based on camera function, wearable glasses shooting text Word content;Based on network savvy, wearable glasses send shooting photo to high in the clouds;Based on media play function, wearable glasses Play the relevant audio file or video file retrieved based on word content.
In one embodiment, the corresponding pass between word content and the segmentation of the chapters and sections of multimedia messages is established beyond the clouds System.High in the clouds executes after optical character identification is converted to word content will shoot photo shooting photo, based on word content inspection The corresponding chapters and sections segmentation of rope, and push the retrieval result for including the corresponding chapters and sections segmentation.It therefore, can be in order to user The real interested content of quick positioning playing.
For example, user has on wearable glasses, there is one in front《The The Romance of the Three Kingdoms》Books.Using wearable glasses to text " Supreme Being stuns content, and left and right first aid enters palace, and officials of all ranks and descriptions all run quickly and keep away.Moment, Herba Botrychii.Suddenly great Lei heavy rain is subject to hail, falls on Midnight side stops, and bad but house is countless." shot, and shooting photo is sent to high in the clouds.High in the clouds executes optics to shooting photo Character recognition, to get word content, (i.e. " Supreme Being stuns, and left and right first aid enters palace, and officials of all ranks and descriptions all run quickly and keep away.Moment, Herba Botrychii. Suddenly great Lei heavy rain is subject to hail, falls on midnight side and stops, and bad but house is countless ".Then, high in the clouds is retrieved based on word content, really The fixed word content source is certainly《The The Romance of the Three Kingdoms》The content of first time " dinner peach garden hero three become sworn brothers cut yellow towel hero head render meritorious service ", and to Wearable glasses push includes TV programme《The The Romance of the Three Kingdoms》The retrieval result of the related multimedia content of first collection.Wearable eye After mirror receives the multimedia content, you can to play《The The Romance of the Three Kingdoms》The video content of first collection.
Fig. 4 is the information retrieval schematic diagram based on mobile terminal according to embodiment of the present invention.It, can for mobile terminal Using wechat public platform or independent APP, as service entrance, the customers ownership that wechat is huge can be made full use of, reduce Initial stage promotes cost.Independent maintenance and inquiry service are built in backstage, provide book content retrieval service and basic Book information maintenance function.
In Fig. 4, Books Search searching system provides the business management system at the ends PC, supports system manager accesses to safeguard Typing book information.Including:Books essential information include title, author, the publication date, No. ISBN, books bar code value, chapters and sections mesh Record, briefly introduction, book cover etc.;Books Search searching system is mainly by customizing scheduler task, from book management system Capture book information, including books essential information and book content information;Client is by paying close attention to the wechat public platform (temporarily name " palm Upper reading "), " barcode scanning looks into book " and " take pictures and look into book " function is accessed into public platform, inquires related books reading.Specifically, it is Reason under the overall leadership, the permission and subscriber management function of optimized integration;Local book information library preserves books essential information, fast and easy Inquiry.Essential information includes:Title, author, the publication date, No. ISBN, books bar code value, chapters and sections catalogue, briefly introduce, books Cover;Local book index library, file system preserve book content and index information, subsequently to carry out full text rope to content Draw;Books Search service realizes that the comprehensive inquiry to book information is supported, query context is basic in local book information library Information;Book retrieval engine is realized and is supported the full article retrieval of book content information;The ends PC service management entrance, provides The business work(such as the service management portal, including user management, rights management, taking care of books of the ends PC browser access can be passed through Energy;Book information sync cap, provide with external taking care of books system docking, can periodically capture newest book information, protect Books essential information and identification book content information are deposited, book content full-text index information is generated;Information Mobile Service interface, provides shifting Dynamic service general interface, handles mobile terminal uplink and downlink data, such as:OCR identifications are carried out to uploading pictures.Only connection wechat is public at present Other movement end entrances subsequently can be achieved, such as in many accounts:Primary APP, mobile phone web pages etc.;Wechat public platform provides palm movement Hold access entrance.
Moreover, daily timing initiates book information update application from Library Inquiry System to artwork book system;Artwork book system System returns to the same day newly-increased increment book information, including books essential information and book content information;Books Search searching system The update of increment book information is received, internal data arrangement processing is started;It receives and returns to book information, in the books identified such as OCR Hold information;Start information distribution process, books essential information crawl, obtains books essential information;In books after being identified to OCR Hold information and carries out automatic word segmentation;Book index information is created according to content word segmentation result;Preserve books essential information, book content With book index information.
In addition, in embodiments of the present invention, based on the core technology of computer semantics identity ability, can help to calculate The accurate meaning of machine more intelligently identification information behind.By the way that information is carried out deep layer, multi-level simulation tool, it has not only been understood Code also identifies information intention to be expressed, makes computer more intelligent, more humanely and human communication.
Embodiment of the present invention has mainly used the technological means of metalanguage linear structure+keyword (i.e. language block), from The real intention of information is accurately extracted in the linear structure and keyword of language.One sentence to be analyzed includes linear junction Structure and keyword (i.e. language block).Wherein, the key of semantics identity is to identify the linear structure of sentence.The meaning of language is hidden In the linear structure of sentence, the linear structure of sentence is equivalent to the constant of language.The meaning of one's words or even meaning and thinking are all hidden In the linear structure of sentence, by the linear structure of anolytic sentence, the purpose that identification is intended to can reach.Keyword is equivalent to The variable of language.By replacing corresponding portion (i.e. variable), the meaning of one's words can retain substantially, can accurately be retrieved or Translation result.Moreover, bilingual, Dan Yujun can accurately identify the meaning of one's words using structural analysis.By to vast as the open sea document money Material carries out linear structure+key word analysis sentence by sentence, we can obtain sufficient sentence linear structure and keyword (i.e. language block).
It illustrates:1,Rural tourismAsTourism Industry in ChinaImportant component and promotionTourism developmentIt is important Support.(example 1);2,China's economicAsWorld economyImportant component and promotionGlobal finance is stablizedImportant branch Support.(example 2)
By analyzing both the above example, it can be found that:
" rural tourism ", " Tourism Industry in China " and " tourism development " is equivalent to the variable of example 1, because by replacing phase Should partly (i.e. variable), the meaning of one's words can retain substantially.And " important supports of the x as the important component and promotion x of x " (wherein x indicates blank) is equivalent to the linear structure of example 1, that is, the constant of language, because the meaning of language is hidden in this In linear structure.Similarly, " China's economic ", " world economy ", " Global finance stabilization " are equivalent to the variable of example 2, because For by replacing corresponding portion (i.e. variable), the meaning of one's words can retain substantially.And " important components and promotion x of the x as x Important support " (wherein x indicate blank) be equivalent to the linear structure of example 2, that is, the constant of language, because of the meaning of language Justice is hidden in the linear structure.It can be found that the two exemplary linear structures are identical, variable is differed only in It is different." important supports of the x as the important component and promotion x of x " (wherein x indicates blank) can be defined as a kind of line Property structure, and " rural tourism ", " Tourism Industry in China ", " tourism development ", " China's economic ", " world economy " and " whole world gold Melt stabilization " it is defined as keyword (i.e. language block).Wherein, we can determine some common inherent nouns and/or gerund For constant, but variable is not limited to inherent noun and/or gerund.In some cases, variable can also be a kind of normal Phrase or even long sentence.In addition, when determining constant and linear structure, dividing mode may not be unique 's.For the minimum dividing mode of variable, corresponding to linear structure be known as minimal linear structure.Usually, variable is fewer, It is considered that the information expressed by corresponding linear structure is more abundant, then the information of corresponding search is more accurate.
It illustrates again:
1、A FandaUpsurge is swept acrossChina.(example 3);Speculation in stocksUpsurge is swept acrossThe world.(example 4)
By analyzing both the above example, it is found that " A Fanda " and " China " is equivalent to the variable of example 3, because By replacing corresponding portion (i.e. variable), the meaning of one's words can retain substantially.And " x upsurges sweep across x " (wherein x indicates blank) is quite In the constant of the linear structure of example 3, that is, language, because the meaning of language is hidden in the linear structure.Similarly, " speculation in stocks " and " world " is equivalent to the variable of example 4, because by replacing corresponding portion (i.e. variable), the meaning of one's words can protect substantially It stays.And " x upsurges sweep across x " (wherein x indicates blank) is equivalent to the linear structure of example 4, that is, the constant of language, because of language The meaning of speech is hidden in the linear structure.It can be found that the two exemplary linear structures are identical, change is differed only in Amount is different." x upsurges sweep across x " (wherein x indicate blank) can be defined as to a kind of linear structure, and " A Fanda ", " China ", " speculation in stocks " and " world " is defined as keyword (i.e. language block).
It illustrates again:
1、TheyAppealEuropean CommissionObjectively and fairly treatThe MET (Market Economy Treatment) application of Chinese Enterprise.(example 5);2,International Football UnionAppealIrelandObjectively and fairly treatThe result of the match of qualifying match of World Cup and French team.(example 6);3,State Border societyAppealThe Six-Party TalksObjectively and fairly treatKorea problem.(example 7);4,ChinaAppealJapanese governmentIt is objective, just It treats on groundWorld War II historical problem.(example 8);By analyzing four examples above, it can be found that:
" they ", " European Commission " and " the MET (Market Economy Treatment) application of Chinese Enterprise " are equivalent to the variable of example 5, because logical Replacement corresponding portion (i.e. variable) is crossed, the meaning of one's words can retain substantially.And " x appeals that x objectively and fairly treats x " (wherein x tables Show blank) it is equivalent to the linear structure of example 5, that is, the constant of language, because the meaning of language is hidden in the linear structure In the middle.Similarly, " International Football Union ", " Ireland " and " result of the match of qualifying match of World Cup and French team " is equivalent to example 6 Variable because by replacing corresponding portion (i.e. variable), the meaning of one's words can retain substantially.And " x appeals that x is objectively and fairly right Wait for that x " (wherein x indicates blank) is equivalent to the linear structure of example 6, that is, the constant of language, because the meaning of language is hidden In the linear structure.Similarly, " international community ", " the Six-Party Talks " and " Korea problem " is equivalent to the variable of example 6, because For by replacing corresponding portion (i.e. variable), the meaning of one's words can retain substantially.And " x appeals that x objectively and fairly treats x " is (wherein X indicates blank) it is equivalent to the linear structure of example 6, that is, the constant of language, because the meaning of language is hidden in the linear junction In structure.Similarly, " China ", " Japanese government " and " World War II historical problem " is equivalent to the variable of example 7, because passing through replacement Corresponding portion (i.e. variable), the meaning of one's words can retain substantially.And " x appeals that x objectively and fairly treats x " (wherein x indicates blank) It is equivalent to the linear structure of example 7, that is, the constant of language, because the meaning of language is hidden in the linear structure.It can To find, this four exemplary linear structures are identical, differ only in variable difference.It can be by " x appeals x objectively and fairly Treat x " (wherein x indicate blank) " be defined as a kind of linear structure, and " they ", " European Commission ", " market of Chinese Enterprise passes through Help treatment application ", " International Football Union ", " Ireland ", " result of the match of qualifying match of World Cup and team of France ", " international community ", " the Six-Party Talks ", " Korea problem ", " China ", " Japanese government " and " World War II historical problem " are defined as keyword (i.e. language block). It is above-mentioned by being carried out to lot of documents (including web documents, blog, textbook, various electronic documents etc.) based on above-mentioned analysis Cutting, we can be obtained by sufficient linear structure library and keyword (i.e. language block) library.
The natural language processing method the present invention is based on semantics identity is described in detail again below.Fig. 5 is according to of the invention real Apply the natural language processing method flow diagram based on semantics identity of mode.Flow shown in Fig. 5 can be with the high in the clouds side of application drawing 1. As shown in figure 5, this method includes:
Step 501:It is character string that the word of chapter grade, which is utilized symbol cutting, and extracts language from the character string cut out Say linear structure and language block.Herein, the word of chapter grade (for example, an article or an editorial) is utilized into symbol first Cutting is several character strings, and extracts language linear structure and language block (specific extraction successively from the character string cut out Step is referred to aforementioned exemplary analysis)." chapter grade " is not meant to there is any specific restriction to the number of word herein. Substantially, as long as there is some vocabulary, and the sentence formed between these vocabulary is meaningful, so that it may to think these vocabulary It constitutes " chapter grade ".More specifically, can according to fullstop, question mark, exclamation, comma, pause mark, branch, colon, quotation marks, bracket, Dash, ellipsis, mark of emphasis, hyphen, separation dot, punctuation marks used to enclose the title, line under or beside a word to show that it is a proper noun, annotation number, the number of avoiding mentioning, empty lacking number, slash, Identification number, instead of number, like a chain of pearls or a string of beads number and/or the punctuation marks such as arrow number, be character string by the word segmentation of chapter grade.For example, can Using by the Word Input between arbitrary two punctuation marks as the character string (starting for article, it is only necessary to punctuate symbol Number).When determining keyword (language block), we can use a local substring statistical form (hash table) based on chapter As interim auxiliary dictionary.That is, if there is in auxiliary dictionary temporarily, we can be determined as language block. But, certain not appear in local substring statistical form, language block can also be determined as.It can also use and be based on multipath The cutting route tree of planning is as segmentation model, first by English (ASCII), simplified form of Chinese Character (GBK/GB18030), Chinese-traditional Character codes such as (Taiwan BIG5, Hong Kong BIG5-HKSCS) are uniformly converted to carries out cutting after UTF-8 coded formats again, and Language block is extracted on the basis of multiple correct cutting results.After having extracted language block, remaining part is exactly linear structure.
Step 502:The language linear structure and language block that extract are arranged respectively.Herein, row is specific wraps It includes:For the language block of each qualification, by number of documents, paragraph, sentence number, word order number and the HTML information where the language block Equal one structure of boil down tos is put into the living document where the language block;Wherein language block can be arbitrary character string, main to wrap Include following classification:It is dictionary entry, proper name, the internal vocabulary of proper name, all kinds of phrase/Matching Relations, n-grams, continuous Stopwords, word+number, arbitrary ASCII strings, postcode and telephone number etc..And for the language linear junction of each qualification Structure, can be by boil down tos such as number of documents, paragraph, sentence number, word order number and HTML informations where the language linear structure One structure is put into the living document where the language block.
Step 503:Create language linear structure subindex and language block subindex, and by language linear structure subindex and Language block subindex is merged, to form whole index.Herein, by whole language block index entry (index in memory Terms language block vocabulary (vocabulary) file) is written, inv_lists files are written after invertedhits is merged, and Dictionary (dictionary) file is written into related information between the two.These three files constitute a complete, independent rope Draw section (index run), i.e. language block subindex.Moreover, whole linear structure index entries (index terms) in memory are write Enter linear structure vocabulary (vocabulary) file, inv_lists files is written after inverted hits are merged, and will Linear structural word allusion quotation (dictionary) file is written in related information between the two.These three files constitute one completely, solely Vertical index segment (index run), i.e. linear structure subindex.Finally, by language linear structure subindex and language block subindex Merged, to form whole index.
Step 504:It is inputted in character string from the retrieval of user and extracts language linear structure and language block, and according to described Entirety indexes the information to match with the language linear structure and language block extracted from the retrieval input of user to user feedback. Herein, it is inputted in character string from the retrieval of user first and extracts linear structure and language block.For example, if user inputs " I It is delithted with the Big Apple for eating Yantai production." then extract language block " I ", the Big Apple of production " Yantai " and linear structure x and be delithted with Eat x (wherein x be blank), then retrieved in integrally indexing matching linear structure " x, which is delithted with, eats x " and language block " I ", The information of " Big Apple of Yantai production ", and presented to user according to the sequence of matching degree from high to low.In an embodiment party In formula, when the language linear structure extracted in the retrieval input from the user and the language linear structure in whole index When repetition number of words is more, it is believed that this matching degree is higher.In one embodiment, language line can also be pre-set Property structure repeat weight and language block and repeat weight;Retrieval of the weight calculation from the user is repeated based on the language linear structure First Chong Die index of the language linear structure extracted in input and the language linear structure in whole index, and it is based on language block Weight calculation is repeated from retrieve the language block in the language block extracted in input and whole index second of the user is Chong Die to refer to Number;When it is described first overlapping index index Chong Die with second and it is higher, the matching degree is higher.Wherein, to user feedback The information to match with the language linear structure and language block extracted from the retrieval input of user can specifically include:Described The language linear structure and language block for retrieving the input character string in whole index respectively, with determine in whole index with the input word The corresponding language linear structure of language linear structure of string is accorded with, and determines the language block in whole index with the input character string Corresponding language block;Involved by the corresponding language linear structure of this in integrally being indexed to user feedback and the corresponding language block Information.
The flow of the present invention can be applied in a variety of specific practical applications, for example information retrieval and multilingual turned over It translates.When applied to multilingual translation, it is assumed that the retrieval input character string of user is to input word with the retrieval that first language is stated Symbol string.At this point, it is linear to extract the language that the input character string first language is stated from the retrieval input character string of user Structure and language block;Then it determines again and the language linear structure with first language statement and language block is corresponding uses second language The language linear structure and language block of statement;The language line stated to user feedback and with second language is indexed according to the entirety The information that property structure and language block match and equally stated with second language.Wherein, first language can be Chinese, second language For English, Japanese, Korean, Arabic, Spanish, Portuguese, French or Russian, etc..Optionally, the first language Speech is English, Japanese, Korean, Arabic, Spanish, Portuguese, French or Russian, and second language is Chinese Deng.Citing:User it is expected Chinese " I will go to Shanghai " translating into English.At this point, retrieval input character string input by user is " I will go to Shanghai " is used in combination Chinese to state.First, it inputs in character string and is extracted in input character string use from the retrieval of user The language linear structure of text statement is (i.e.:X will remove x, and wherein x is blank) and the language block (I, Shanghai) stated of Chinese;Then again really Determine and language linear structure (the i.e. x want to go that in English states corresponding with the Chinese language linear structure of statement To), and determine and the language block (i.e. I, Shanghai) stated in English corresponding with the Chinese language block of statement.Finally, Language block and linear structure are combined into the sentence I want to go to Shanghai of translation, and are presented to the user.Further Ground can also be indexed further according to whole to user feedback and linear structure (x want to go to) language block (I, Shanghai) The information for matching and being stated with second language, consequently facilitating user search and I want to go to Shanghai are relevant English information.
In above process, a kind of high performance single pass memory is exemplarily applied to fall to arrange algorithm, it is any without generating Temporary disc file.Therefore, before exporting memory content, in addition to MAP data, system does not have any file I/O expense.Together When, it need not also number index terms, and not appoint to index term (number or memory character string pointer) What sort operation.In addition, this method is arranged using all available free physical memories.These properties ensure that this falls Discharge method can have outstanding spatiotemporal efficiency, and a series of efficient dynamic indexs can be supported to merge the method with index upgrade.Together When, the inverted index for having the characteristic is also completely suitable for distributed treatment.In above process, another key feature is it Searching data structure has caching functions, this characteristic can support almost arbitrarily large index thesaurus (i.e. vocabulary texts Part).Vocabulary files itself are placed on disk, and the number for the index entry that can be preserved is unrestricted (in 64-bit texts In part system), it can up to several hundred million.By caching functions, which can reach on the x64 servers of 4~6GB memories To with index thesaurus query performance similar in the cluster inquiry system including more same or higher configuration servers.Moreover, Index terms can be arbitrary character string, include mainly following classification (term categories):Dictionary entry, specially It is name, the internal vocabulary of proper name, all kinds of phrase/Matching Relations, n-grams, continuous stopwords, word+number, arbitrary ASCII strings, postcode and telephone number etc..
Based on above-mentioned analysis, embodiment of the present invention also proposed a kind of natural language processing dress based on semantics identity It sets.
Fig. 6 is the natural language processing device structure chart based on semantics identity according to embodiment of the present invention.It can incite somebody to action The device is applied to the high in the clouds of Fig. 1.
As shown in fig. 6, the device includes the device include extraction unit 601, fall row unit 602, indexing units 603 and With information feedback unit 604, wherein:Extraction unit 601 is character string for the word of chapter grade to be utilized symbol cutting, and Language linear structure and language block are extracted from the character string cut out;Specifically, extraction unit 601 is first by the word of chapter grade (for example, an article or an editorial) is several character strings using symbol cutting, and from the character string cut out successively Extract language linear structure and language block (specific extraction step is referred to aforementioned exemplary analysis).More specifically, can root According to fullstop, question mark, exclamation, comma, pause mark, branch, colon, quotation marks, bracket, dash, ellipsis, mark of emphasis, hyphen, Every number, punctuation marks used to enclose the title, line under or beside a word to show that it is a proper noun, annotation number, the number of avoiding mentioning, empty lacking number, slash, identification number, instead of number, like a chain of pearls or a string of beads number and arrow number etc. The word segmentation of chapter grade is character string by punctuation mark.For example, the word between arbitrary two punctuation marks can be carried It is taken as character string (starting for article a, it is only necessary to punctuation mark).When determining keyword (language block), one can be used A local substring statistical form (hash table) based on chapter is as interim auxiliary dictionary.That is, if there is facing When auxiliary dictionary in, so that it may determined with being determined as language.But, certain not appear in local substring statistical form, also may be used To be determined as language block.The cutting route tree based on multi-path planning can also be used as segmentation model, it first will be English (ASCII), the character codes such as simplified form of Chinese Character (GBK/GB18030), Chinese-traditional (Taiwan BIG5, Hong Kong BIG5-HKSCS) are unified It is converted to UTF-8 coded formats and carries out cutting again later, and language block is extracted on the basis of multiple correct cutting results.It has extracted After language block, remaining part is exactly linear structure.Arrange unit 602, for respectively to the language linear structure that extracts with And language block is arranged;Specifically, language block of the unit 602 for each qualification is arranged, by number of documents, the paragraph where the language block Number, one structure of boil down tos such as sentence number, word order number and HTML information, be put into the living document where the language block;Wherein Language block can be arbitrary character string, include mainly following classification:Dictionary entry, proper name, the internal vocabulary of proper name, all kinds of words Group/Matching Relation, n-grams, continuous stopwords, word+number, arbitrary ASCII strings, postcode and telephone number etc..And For the language linear structure of each qualification, arranging unit 502 can be by number of documents, the paragraph where the language linear structure Number, one structure of boil down tos such as sentence number, word order number and HTML information, be put into the living document where the language block.
Indexing units 603, for creating language linear structure subindex and language block subindex, and by language linear structure Subindex and language block subindex are merged, to form whole index;Specifically, indexing units 603 are by whole languages in memory Vocabulary files are written in block index entry (index terms), write-in inv_lists texts after inverted hits are merged Part, and dictionary files are written into related information between the two.These three files constitute a complete, independent index Section (index run), i.e. language block subindex.Moreover, whole linear structure index entries (index terms) in memory are written Vocabulary files, will inverted hits merge after be written inv_lists files, and by related information between the two Dictionary files are written.These three files constitute a complete, independent index segment (index run), i.e. linear structure Subindex.
Finally, indexing units 603 are merged language linear structure subindex and language block subindex, to form whole rope Draw.Match information feedback unit 604 extracts language linear structure and language block for being inputted in character string from the retrieval of user, And it is indexed according to the entirety and retrieves the language linear structure and language block extracted in input with from user to user feedback The information to match.In one embodiment, match information feedback unit 604, for according to language linear structure and language block Matching degree sequence from high to low, to user feedback and the language linear structure extracted from the retrieval input of user and The information that language block matches.Moreover, when the language linear structure extracted in the retrieval input from the user is indexed with whole In language linear structure repetition number of words it is more when, the matching degree is higher.In one embodiment, match information is anti- Unit 604 is presented, is further used for pre-setting language linear structure repetition weight and language block repeats weight;And it is based on the language Linear structure repeats weight calculation from the language linear structure and whole index extracted in the retrieval input of the user First overlapping index of language linear structure, and repeat to extract during weight calculation is inputted from the retrieval of the user based on language block Language block and the language block in whole index the second Chong Die index;Wherein when the first overlapping index index Chong Die with second Higher, the matching degree is higher.In one embodiment, match information feedback unit 604, in the whole rope Draw the middle language linear structure and language block for retrieving the input character string respectively, with determine in whole index with the input character string The corresponding language linear structure of language linear structure, and determine corresponding with the language block of the input character string in whole index Language block;Letter involved by the corresponding language linear structure of this in integrally being indexed to user feedback and the corresponding language block Breath.In one embodiment, the retrieval input character string of user is to input character string with the retrieval that first language is stated;This When, match information feedback unit 604 extracts the first language of the input character string for being inputted in character string from the retrieval of user Say the language linear structure and language block of statement;Determining language linear structure and language block with this with first language statement is corresponding The language linear structure and language block stated with second language;It is indexed to user feedback according to the entirety and uses second language table The information that the language linear structure and language block stated match and equally stated with second language.
Based on above-mentioned detailed description, embodiment of the present invention also proposed a kind of natural language processing based on semantics identity System.Fig. 7 is the natural language processing system structure chart based on semantics identity according to embodiment of the present invention.As shown in fig. 7, The system includes information collection apparatus 301, data storage device 302, natural language processing device 303, index storage device 304 With retrieval service device 305.Wherein:Information collection apparatus 301 is crawled for being scanned detection to internet on internet Information;Data storage device 302 for storing the internet information crawled by information collection apparatus, and preferably provides mutually The quick positioning searching of networked information;Natural language processing device 303, for using symbol to being stored in data storage device The word of chapter grade in 302, cutting are character string, and language linear structure and language block are extracted from the character string cut out; And the language linear structure and language block that extract are arranged respectively;And for create language linear structure subindex with And language block subindex, and language linear structure subindex and language block subindex are merged, to form whole index;Index is deposited Storage device 304, for storing the whole index generated by natural language processing device 303;Retrieval service device 305, for from Language linear structure and language block are extracted in the retrieval input character string of user, and according to the described of index storage device storage Entirety indexes the information to match with the language linear structure and language block extracted from the retrieval input of user to user feedback. Wherein, information collection apparatus 301 may further receive the upload information that newpapers and periodicals, broadcasting and TV and each media member etc. are provided (such as News Resources) service.Moreover, retrieval service device 305 can inquire news free of charge for ordinary user, and it is directed to Professional user registers and opens high-end business after paying.Preferably, natural language processing device 303, for according to fullstop, ask Number, exclamation, comma, pause mark, branch, colon, quotation marks, bracket, dash, ellipsis, mark of emphasis, hyphen, separation dot, title Number, line under or beside a word to show that it is a proper noun, annotation number, the number of avoiding mentioning, empty lacking number, slash, identification number, instead of number, like a chain of pearls or a string of beads number and arrow number, by the chapter The word segmentation of grade is character string.Preferably, natural language processing device 303, for being united using the local substring based on chapter Table is counted as interim auxiliary dictionary, uses the cutting route tree based on multi-path planning as segmentation model, by the chapter grade The equal character codes of word are uniformly converted to UTF-8 coded formats;And the word to being converted to the chapter grade after UTF-8 coded formats It is character string using symbol cutting.Moreover, retrieval service device 305, can be used for the sequence from high to low according to matching degree, The information to match to user feedback with the language linear structure and language block extracted from the retrieval input of user.In a reality It applies in mode, retrieval service device 305, is used for the sequence from high to low according to the matching degree of language linear structure and language block, The information to match to user feedback with the language linear structure and language block extracted from the retrieval input of user.Wherein, excellent Selection of land, when the language linear structure extracted in the retrieval input from the user and the language linear structure in whole index When repetition number of words is more, the matching degree is higher.
In one embodiment, retrieval service device 305 is further used for pre-setting language linear structure repetition power Weight and language block repeat weight;And it repeats to extract during weight calculation is inputted from the retrieval of the user based on the language linear structure First Chong Die index of the language linear structure gone out and the language linear structure in whole index, and weight meter is repeated based on language block Calculate the second Chong Die index of the language block and the language block in whole index that are extracted from the retrieval input of the user;Wherein work as institute State the first overlapping index index Chong Die with second and higher, the matching degree is higher.In one embodiment, retrieval clothes Business device 305, language linear structure and language block for retrieving the input character string respectively in the whole index, with determination Language linear structure corresponding with the language linear structure of the input character string in whole index, and determine in whole index Language block corresponding with the language block of the input character string;The corresponding language linear structure of this in integrally being indexed to user feedback and Information involved by the corresponding language block.In one embodiment, retrieval service device 305, for the retrieval from user The language linear structure and language block of input character string first language statement are extracted in input character string;Determine with this with the The corresponding language linear structure and language block stated with second language of language linear structure and language block of one language expression;According to The entirety, which is indexed, to match to user feedback with the language linear structure and language block stated with second language and equally with the The information of two language expressions.Optionally, first language be English, Japanese, Korean, Arabic, Spanish, Portuguese, French or Russian etc., second language are Chinese.First language can also be Chinese, and second language is English, Japanese, Korea Spro Text, Arabic, Spanish, Portuguese, French or Russian, etc..
In conclusion in embodiments of the present invention, shooting word content, and shooting photo is sent to high in the clouds;High in the clouds pair The shooting photo executes optical character identification so that the shooting photo is converted to word content, is examined based on the word content Rope related content, and push the retrieval result for including the related content.After embodiment of the present invention, use can be facilitated Family quick-searching, and play relevant multimedia messages.As it can be seen that embodiment of the present invention, which can be based on identification content, executes inspection Rope is convenient for user's quick search.
Moreover, user can also utilize access entry to play relevant multimedia messages, it is user-friendly.
The foregoing is only a preferred embodiment of the present invention, is not intended to limit the scope of the present invention.It is all Within the spirit and principles in the present invention, any modification, equivalent replacement, improvement and so on should be included in the protection of the present invention Within the scope of.

Claims (10)

1. a kind of information retrieval method based on Text region, which is characterized in that this method includes:
Word content is shot, and sends shooting photo to high in the clouds;
High in the clouds executes optical character identification the shooting photo is converted to word content, based on described to the shooting photo Word content retrieves related content, and pushes the retrieval result for including the related content.
2. the information retrieval method according to claim 1 based on Text region, which is characterized in that in the shooting word Hold, and sends shooting photo and be to high in the clouds:Word content is shot using wearable device, and is sent using the wearable device Photo is shot to high in the clouds;
It is described that related content is retrieved based on word content, and push the retrieval result comprising the related content and include:High in the clouds base Relevant audio file or video file are retrieved in the word content, and the audio file or video file are sent to institute State wearable device;
This method further includes:
The wearable device plays the audio file or video file.
3. the information retrieval method according to claim 1 based on Text region, which is characterized in that in the shooting word Hold, and sends shooting photo and be to high in the clouds:Word content is shot using mobile terminal, and shooting is sent using the mobile terminal Photo is to high in the clouds;
It is described that related content is retrieved based on word content, and push the retrieval result comprising the related content and include:High in the clouds base Relevant audio file or video file are retrieved in the word content, and the audio file or video file are sent to institute State mobile terminal;
This method further includes:
Audio file or video file described in the mobile terminal playing.
4. the information retrieval method according to claim 1 based on Text region, which is characterized in that this method is also wrapped in advance It includes:
The word of chapter grade is character string using symbol cutting by high in the clouds, and language linear junction is extracted from the character string cut out Structure and language block respectively arrange the language linear structure and language block that extract, create language linear structure subindex with And language block subindex, and language linear structure subindex and language block subindex are merged, to form whole index;
It is described that related content is retrieved based on the word content, and push the retrieval result comprising the related content and include:
The language linear structure and language block of the word content are extracted, and according to the whole index push and out of this word The information that the language linear structure and language block extracted in appearance matches.
5. the information retrieval method according to claim 4 based on Text region, which is characterized in that described by chapter grade Word includes for character string using symbol cutting:
Using the local substring statistical form based on chapter as interim auxiliary dictionary, with the cutting route tree based on multi-path planning As segmentation model, the word character code of the chapter grade is uniformly converted into UTF-8 coded formats;To being converted to UTF-8 The word of chapter grade after coded format is character string using symbol cutting;
Matched with the language linear structure and language block extracted from the word content according to the whole index push Information includes:According to the sequence of the matching degree of language linear structure and language block from high to low, push is carried from the word content The information that the language linear structure and language block of taking-up match, wherein when the language linear structure extracted from the word content When more with the repetition number of words of the language linear structure in whole index, the matching degree is higher.
6. the information retrieval method according to claim 5 based on Text region, which is characterized in that
This method further includes:It pre-sets language linear structure and repeats weight and language block repetition weight;
The language linear structure and entirety that weight calculation is extracted from the word content are repeated based on the language linear structure First overlapping index of the language linear structure in index, and weight calculation is repeated based on language block and is extracted from the word content Language block and the language block in whole index the second Chong Die index;When it is described first overlapping index index Chong Die with second and get over Height, the matching degree are higher.
7. a kind of information retrieval system based on Text region, which is characterized in that the system includes:
Filming apparatus for shooting word content, and sends shooting photo to high in the clouds;
Searching system is located at high in the clouds, for executing optical character identification to the shooting photo to convert the shooting photo For word content, related content is retrieved based on the word content, and include the inspection of the related content to filming apparatus push Hitch fruit.
8. the information retrieval system according to claim 7 based on Text region, which is characterized in that
The filming apparatus is wearable device;
Searching system, for retrieving relevant audio file or video file based on the word content, and the audio is literary Part or video file are sent to the wearable device;
The wearable device is additionally operable to play the audio file or video file.
9. the information retrieval system according to claim 7 based on Text region, which is characterized in that
The filming apparatus is mobile terminal;
Searching system, for retrieving relevant audio file or video file based on the word content, and the audio is literary Part or video file are sent to the mobile terminal;
The mobile terminal is additionally operable to play the audio file or video file.
10. the information retrieval system according to claim 7 based on Text region, which is characterized in that
Searching system is additionally operable to be in advance character string using symbol cutting by the word of chapter grade, and from the character string cut out Language linear structure and language block are extracted, the language linear structure and language block that extract are arranged respectively, creates language Linear structure subindex and language block subindex, and language linear structure subindex and language block subindex are merged, with shape Integral index;It is described that related content is retrieved based on the word content, and push the retrieval result for including the related content Including:Extract the language linear structure and language block of the word content, and according to the whole index push with from the word The information that the language linear structure and language block extracted in content matches;
Wherein include for character string using symbol cutting by the word of chapter grade:
Using the local substring statistical form based on chapter as interim auxiliary dictionary, with the cutting route tree based on multi-path planning As segmentation model, the word character code of the chapter grade is uniformly converted into UTF-8 coded formats;To being converted to UTF-8 The word of chapter grade after coded format is character string using symbol cutting;
Matched with the language linear structure and language block extracted from the word content according to the whole index push Information includes:According to the sequence of the matching degree of language linear structure and language block from high to low, push is carried from the word content The information that the language linear structure and language block of taking-up match, wherein when the language linear structure extracted from the word content When more with the repetition number of words of the language linear structure in whole index, the matching degree is higher.
CN201710251901.1A 2017-04-18 2017-04-18 A kind of information retrieval method and system based on Text region Pending CN108733687A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710251901.1A CN108733687A (en) 2017-04-18 2017-04-18 A kind of information retrieval method and system based on Text region

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710251901.1A CN108733687A (en) 2017-04-18 2017-04-18 A kind of information retrieval method and system based on Text region

Publications (1)

Publication Number Publication Date
CN108733687A true CN108733687A (en) 2018-11-02

Family

ID=63924739

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710251901.1A Pending CN108733687A (en) 2017-04-18 2017-04-18 A kind of information retrieval method and system based on Text region

Country Status (1)

Country Link
CN (1) CN108733687A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111489595A (en) * 2020-04-16 2020-08-04 广东小天才科技有限公司 Method and device for feeding back test information in live broadcast teaching process
CN111489596A (en) * 2020-04-16 2020-08-04 广东小天才科技有限公司 Method and device for information feedback in live broadcast teaching process

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101668071A (en) * 2009-08-14 2010-03-10 惠州Tcl移动通信有限公司 Mobile communication terminal with scanning function and implement method thereof
CN102789464A (en) * 2011-05-20 2012-11-21 陈伯妤 Natural language processing method, device and system based on semanteme recognition
CN104199834A (en) * 2014-08-04 2014-12-10 徐�明 Method and system for interactively obtaining and outputting remote resources on surface of information carrier
CN104217197A (en) * 2014-08-27 2014-12-17 华南理工大学 Touch reading method and device based on visual gestures

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101668071A (en) * 2009-08-14 2010-03-10 惠州Tcl移动通信有限公司 Mobile communication terminal with scanning function and implement method thereof
CN102789464A (en) * 2011-05-20 2012-11-21 陈伯妤 Natural language processing method, device and system based on semanteme recognition
CN104199834A (en) * 2014-08-04 2014-12-10 徐�明 Method and system for interactively obtaining and outputting remote resources on surface of information carrier
CN104217197A (en) * 2014-08-27 2014-12-17 华南理工大学 Touch reading method and device based on visual gestures

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111489595A (en) * 2020-04-16 2020-08-04 广东小天才科技有限公司 Method and device for feeding back test information in live broadcast teaching process
CN111489596A (en) * 2020-04-16 2020-08-04 广东小天才科技有限公司 Method and device for information feedback in live broadcast teaching process
CN111489595B (en) * 2020-04-16 2022-06-24 广东小天才科技有限公司 Method and device for feeding back test information in live broadcast teaching process

Similar Documents

Publication Publication Date Title
JP5746286B2 (en) High-performance data metatagging and data indexing method and system using a coprocessor
CN101452470B (en) Summary-style network search engine system and search method and uses
Huston et al. Evaluating verbose query processing techniques
US8577882B2 (en) Method and system for searching multilingual documents
CN102789464B (en) Natural language processing methods, devices and systems based on semantics identity
CN109902288A (en) Intelligent clause analysis method, device, computer equipment and storage medium
US20080201314A1 (en) Method and apparatus for using multiple channels of disseminated data content in responding to information requests
CA2774278A1 (en) Methods and systems for extracting keyphrases from natural text for search engine indexing
Jabbar et al. A survey on Urdu and Urdu like language stemmers and stemming techniques
AU2013290306A1 (en) Weight-based stemming for improving search quality
CN101000611A (en) Method for providing and inquiry information for public by interconnection network
CN110807326A (en) Short text keyword extraction method combining GPU-DMM and text features
CN102117285B (en) Search method based on semantic indexing
Subhashini et al. Shallow NLP techniques for noun phrase extraction
WO2017000659A1 (en) Enriched uniform resource locator (url) identification method and apparatus
CN108733687A (en) A kind of information retrieval method and system based on Text region
Modi et al. Multimodal web content mining to filter non-learning sites using NLP
CN113934910A (en) Automatic optimization and updating theme library construction method and hot event real-time updating method
Li et al. CLC-RS: a Chinese legal case retrieval system with masked language ranking
Chen et al. Chinese named entity abbreviation generation using first-order logic
KR20020006223A (en) Automatic Indexing Robot System And A Method
KR20130142192A (en) Assistance for video content searches over a communication network
CN105488035A (en) Conversational natural language processing method and device
Sheng et al. Cross-Language Text Search Algorithm Based on Context-Compatible Algorithms
Zuo et al. Cross-Genre Retrieval for Information Integrity: A COVID-19 Case Study

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20181102

WD01 Invention patent application deemed withdrawn after publication