CN108733687A - A kind of information retrieval method and system based on Text region - Google Patents
A kind of information retrieval method and system based on Text region Download PDFInfo
- Publication number
- CN108733687A CN108733687A CN201710251901.1A CN201710251901A CN108733687A CN 108733687 A CN108733687 A CN 108733687A CN 201710251901 A CN201710251901 A CN 201710251901A CN 108733687 A CN108733687 A CN 108733687A
- Authority
- CN
- China
- Prior art keywords
- language
- linear structure
- word
- content
- block
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/50—Network services
- H04L67/55—Push-based network services
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Embodiment of the present invention discloses a kind of information retrieval method and system based on Text region.Method includes:Word content is shot, and sends shooting photo to high in the clouds;High in the clouds executes optical character identification so that the shooting photo is converted to word content to the shooting photo, retrieves related content based on the word content, and push the retrieval result for including the related content.After embodiment of the present invention, user's quick-searching can be facilitated, and play relevant multimedia messages.
Description
Technical field
The present invention relates to search fields.More particularly, to a kind of information retrieval method and system based on Text region.
Background technology
Natural language processing (Information Retrieval) refers to that information is organized in a certain way, and root
According to the process and technology for needing to find out related information of information user.The natural language processing of narrow sense is exactly natural language processing
The latter half of process, i.e., find out the process of required information from information aggregate, that is, our Information searchings for often saying
(Information Search or Information Seek).
Currently used natural language processing method generally includes:Common law, retroactive method and discrete method etc..Common law is profit
The method for carrying out documents searching with gophers such as bibliography, digest, indexes.The key of fortune in this way is to be familiar with each
Property, feature and the search procedure of kind gopher, are searched from different perspectives.Common law can be divided into again along inspection method and the method for falling inspection.
Along inspection method retrieved in chronological order till now from the past, expense is more, efficiency is low;Inspection method is inverse time sequencing from the recent period to remote
Phase is retrieved, it emphasizes recent data, payes attention to current information, and initiative is strong, and effect is preferable.Retroactive method is to utilize existing document institute
The method that attached bibliography constantly tracks lookup, not full-time in no gopher or gopher, this method can get and be directed to
Property is very strong.
Since stepping into 21 century, with the fast development of Internet (Internet) and adding for world economic integration
Speed, the network information drastically expand, and international exchange is increasingly frequent, and information is retrieved to assist people's quick obtaining to believe by network
Breath, has become inevitable trend.
Invention content
Embodiment of the present invention proposes a kind of information retrieval method and system based on Text region, with easily to user
Feedback information.
What the technical solution of embodiment of the present invention was realized in:
A kind of information retrieval method based on Text region, this method include:Word content is shot, and sends shooting photo
To high in the clouds;High in the clouds executes optical character identification so that the shooting photo is converted to word content to the shooting photo, is based on
The word content retrieves related content, and pushes the retrieval result for including the related content.
In one embodiment, the shooting word content, and send shooting photo and be to high in the clouds:It is set using wearable
Standby shooting word content, and shooting photo is sent to high in the clouds using the wearable device;
It is described that related content is retrieved based on word content, and push the retrieval result comprising the related content and include:Cloud
End group retrieves relevant audio file or video file in the word content, and the audio file or video file are sent
To the wearable device;
This method further includes:The wearable device plays the audio file or video file.
In one embodiment, the shooting word content, and send shooting photo and be to high in the clouds:Utilize mobile terminal
Word content is shot, and shooting photo is sent to high in the clouds using the mobile terminal;
It is described that related content is retrieved based on word content, and push the retrieval result comprising the related content and include:Cloud
End group retrieves relevant audio file or video file in the word content, and the audio file or video file are sent
To the mobile terminal;
This method further includes:Audio file or video file described in the mobile terminal playing.
In one embodiment, this method also includes in advance:The word of chapter grade is word using symbol cutting by high in the clouds
Symbol string, and extracts language linear structure and language block from the character string cut out, respectively to the language linear structure that extracts with
And language block is arranged, and creates language linear structure subindex and language block subindex, and by language linear structure subindex and
Language block subindex is merged, to form whole index;It is described that related content is retrieved based on the word content, and push and include
The retrieval result of the related content includes:The language linear structure and language block of the word content are extracted, and according to described
The information that whole index push matches with the language linear structure and language block extracted from the word content.
In one embodiment, the word by chapter grade includes for character string using symbol cutting:Using based on
The local substring statistical form of chapter uses the cutting route tree based on multi-path planning as dividing die as interim auxiliary dictionary
The word character code of the chapter grade is uniformly converted to UTF-8 coded formats by type;After being converted to UTF-8 coded formats
Chapter grade word using symbol cutting be character string;It is extracted with from the word content according to the whole index push
The information that the language linear structure and language block gone out matches includes:According to language linear structure and language block matching degree by height to
Low sequence pushes the information that the language linear structure that is extracted from the word content and language block match, wherein when from this
It is described when the repetition number of words of the language linear structure extracted in word content and the language linear structure in whole index is more
Matching degree is higher.
In one embodiment, this method further includes:It pre-sets language linear structure and repeats weight and language block repetition
Weight;The language linear structure and entirety that weight calculation is extracted from the word content are repeated based on the language linear structure
First overlapping index of the language linear structure in index, and weight calculation is repeated based on language block and is extracted from the word content
Language block and the language block in whole index the second Chong Die index;When it is described first overlapping index index Chong Die with second and get over
Height, the matching degree are higher.
A kind of information retrieval system based on Text region, the system include:Filming apparatus, for shooting word content,
And shooting photo is sent to high in the clouds;Searching system is located at high in the clouds, for executing optical character identification to the shooting photo to incite somebody to action
The shooting photo is converted to word content, retrieves related content based on the word content, and include to filming apparatus push
The retrieval result of the related content.
In one embodiment, the filming apparatus is wearable device;Searching system, for based in the word
Hold and retrieve relevant audio file or video file, and the audio file or video file are sent to described wearable set
It is standby;The wearable device is additionally operable to play the audio file or video file.
In one embodiment, the filming apparatus is mobile terminal;Searching system, for being based on the word content
Relevant audio file or video file are retrieved, and the audio file or video file are sent to the mobile terminal;Institute
Mobile terminal is stated, is additionally operable to play the audio file or video file.
In one embodiment, searching system is additionally operable to be in advance character using symbol cutting by the word of chapter grade
String, and extracts language linear structure and language block from the character string cut out, respectively to the language linear structure that extracts and
Language block is arranged, and creates language linear structure subindex and language block subindex, and by language linear structure subindex and language
Block subindex is merged, to form whole index;It is described that related content is retrieved based on the word content, and it includes institute to push
The retrieval result for stating related content includes:Extract the language linear structure and language block of the word content, and according to described whole
The information that the push of body index matches with the language linear structure and language block extracted from the word content;Wherein by chapter grade
Word include for character string using symbol cutting:Using the local substring statistical form based on chapter as interim auxiliary dictionary,
It uses the cutting route tree based on multi-path planning as segmentation model, the word character code of the chapter grade is uniformly converted to
UTF-8 coded formats;It is character string to being converted to the word of the chapter grade after UTF-8 coded formats using symbol cutting;According to
The whole index pushes the information to match with the language linear structure and language block extracted from the word content and includes:
According to the sequence of the matching degree of language linear structure and language block from high to low, the language extracted from the word content is pushed
The information that linear structure and language block match, wherein when the language linear structure extracted from the word content is indexed with whole
In language linear structure repetition number of words it is more when, the matching degree is higher.
In embodiments of the present invention, word content is shot, and sends shooting photo to high in the clouds;The shooting is shone in high in the clouds
Piece executes optical character identification the shooting photo is converted to word content, inside the Pass based on word content retrieval phase
Hold, and pushes the retrieval result for including the related content.After embodiment of the present invention, user can be facilitated quickly to examine
Rope, and play relevant multimedia messages.As it can be seen that embodiment of the present invention, which can be based on identification content, executes retrieval, convenient for using
Family quick search.Furthermore, it is possible to be played and the relevant audio file of word content of shooting or video text using wearable device
Part intuitively obtains information convenient for user.Moreover, user can also utilize access entry to play relevant multimedia messages, it is convenient for
User uses.
Description of the drawings
Fig. 1 is the information retrieval method flow chart based on Text region according to embodiment of the present invention.
Fig. 2 is the schematic diagram that information retrieval is realized using wearable glasses according to embodiment of the present invention.
Fig. 3 is the signal that the multimedia file retrieved is played using wearable glasses according to embodiment of the present invention
Figure.
Fig. 4 is the information retrieval schematic diagram based on mobile terminal according to embodiment of the present invention.
Fig. 5 is the natural language processing method flow diagram based on semantics identity according to embodiment of the present invention;
Fig. 6 is the natural language processing device structure chart based on semantics identity according to embodiment of the present invention;
Fig. 7 is the natural language processing system structure chart based on semantics identity according to embodiment of the present invention.
Specific implementation mode
To make the object, technical solutions and advantages of the present invention express to be more clearly understood, below in conjunction with the accompanying drawings and specifically
The present invention is further described in more detail for embodiment.
In embodiments of the present invention, the content of text such as books are shot by equipment such as mobile phones, is uploaded to the automatic knowledge in backstage
Matching books or other multimedia documents are not retrieved.Retrieval result may be there are a plurality of, when displaying by correlation by
High to Low sequence.User can click books by retrieval result list, open book content using mobile phone browser and read, use
Family can also play corresponding audio or video by wearable device.
Fig. 1 is the information retrieval method flow chart based on Text region according to embodiment of the present invention.As shown in Figure 1,
This method includes:
Step 101:Word content is shot, and sends shooting photo to high in the clouds;
Step 102:High in the clouds executes optical character identification (OCR) to be converted to the shooting photo to the shooting photo
Word content retrieves related content based on the word content, and pushes the retrieval result for including the related content.
In one embodiment, the shooting word content, and send shooting photo and be to high in the clouds:It is set using wearable
Standby shooting word content, and shooting photo is sent to high in the clouds using the wearable device;It is described that phase is retrieved based on word content
Hold inside the Pass, and pushes the retrieval result comprising the related content and include:High in the clouds is based on the word content and retrieves relevant sound
Frequency file or video file, and the audio file or video file are sent to the wearable device;This method further includes:
The wearable device plays the audio file or video file.
Wherein, wearable device specifically may be embodied as wearable glasses, wearable wrist-watch, wearable bracelet, wearable
Waistband, etc..
In one embodiment, the shooting word content, and send shooting photo and be to high in the clouds:Utilize mobile terminal
Word content is shot, and shooting photo is sent to high in the clouds using the mobile terminal;It is described to be retrieved mutually inside the Pass based on word content
Hold, and pushes the retrieval result comprising the related content and include:High in the clouds is based on the word content and retrieves relevant audio text
Part or video file, and the audio file or video file are sent to the mobile terminal;This method further includes:The shifting
Audio file or video file described in dynamic terminal plays.
In one embodiment, this method also includes in advance:The word of chapter grade is word using symbol cutting by high in the clouds
Symbol string, and extracts language linear structure and language block from the character string cut out, respectively to the language linear structure that extracts with
And language block is arranged, and creates language linear structure subindex and language block subindex, and by language linear structure subindex and
Language block subindex is merged, to form whole index;It is described that related content is retrieved based on the word content, and push and include
The retrieval result of the related content includes:The language linear structure and language block of the word content are extracted, and according to described
The information that whole index push matches with the language linear structure and language block extracted from the word content.
In one embodiment, the word by chapter grade includes for character string using symbol cutting:Using based on
The local substring statistical form of chapter uses the cutting route tree based on multi-path planning as dividing die as interim auxiliary dictionary
The word character code of the chapter grade is uniformly converted to UTF-8 coded formats by type;After being converted to UTF-8 coded formats
Chapter grade word using symbol cutting be character string;It is extracted with from the word content according to the whole index push
The information that the language linear structure and language block gone out matches includes:According to language linear structure and language block matching degree by height to
Low sequence pushes the information that the language linear structure that is extracted from the word content and language block match, wherein when from this
It is described when the repetition number of words of the language linear structure extracted in word content and the language linear structure in whole index is more
Matching degree is higher.
In one embodiment, this method further includes:It pre-sets language linear structure and repeats weight and language block repetition
Weight;The language linear structure and entirety that weight calculation is extracted from the word content are repeated based on the language linear structure
First overlapping index of the language linear structure in index, and weight calculation is repeated based on language block and is extracted from the word content
Language block and the language block in whole index the second Chong Die index;When it is described first overlapping index index Chong Die with second and get over
Height, the matching degree are higher.
Fig. 2 is the schematic diagram that information retrieval is realized using wearable glasses according to embodiment of the present invention.According to Fig. 3
The schematic diagram that the multimedia file retrieved is played using wearable glasses of embodiment of the present invention.In Fig. 2 and shown in Fig. 3
Wearable glasses, which have, takes pictures and network savvy, and has media play function.Based on camera function, wearable glasses shooting text
Word content;Based on network savvy, wearable glasses send shooting photo to high in the clouds;Based on media play function, wearable glasses
Play the relevant audio file or video file retrieved based on word content.
In one embodiment, the corresponding pass between word content and the segmentation of the chapters and sections of multimedia messages is established beyond the clouds
System.High in the clouds executes after optical character identification is converted to word content will shoot photo shooting photo, based on word content inspection
The corresponding chapters and sections segmentation of rope, and push the retrieval result for including the corresponding chapters and sections segmentation.It therefore, can be in order to user
The real interested content of quick positioning playing.
For example, user has on wearable glasses, there is one in front《The The Romance of the Three Kingdoms》Books.Using wearable glasses to text
" Supreme Being stuns content, and left and right first aid enters palace, and officials of all ranks and descriptions all run quickly and keep away.Moment, Herba Botrychii.Suddenly great Lei heavy rain is subject to hail, falls on
Midnight side stops, and bad but house is countless." shot, and shooting photo is sent to high in the clouds.High in the clouds executes optics to shooting photo
Character recognition, to get word content, (i.e. " Supreme Being stuns, and left and right first aid enters palace, and officials of all ranks and descriptions all run quickly and keep away.Moment, Herba Botrychii.
Suddenly great Lei heavy rain is subject to hail, falls on midnight side and stops, and bad but house is countless ".Then, high in the clouds is retrieved based on word content, really
The fixed word content source is certainly《The The Romance of the Three Kingdoms》The content of first time " dinner peach garden hero three become sworn brothers cut yellow towel hero head render meritorious service ", and to
Wearable glasses push includes TV programme《The The Romance of the Three Kingdoms》The retrieval result of the related multimedia content of first collection.Wearable eye
After mirror receives the multimedia content, you can to play《The The Romance of the Three Kingdoms》The video content of first collection.
Fig. 4 is the information retrieval schematic diagram based on mobile terminal according to embodiment of the present invention.It, can for mobile terminal
Using wechat public platform or independent APP, as service entrance, the customers ownership that wechat is huge can be made full use of, reduce
Initial stage promotes cost.Independent maintenance and inquiry service are built in backstage, provide book content retrieval service and basic
Book information maintenance function.
In Fig. 4, Books Search searching system provides the business management system at the ends PC, supports system manager accesses to safeguard
Typing book information.Including:Books essential information include title, author, the publication date, No. ISBN, books bar code value, chapters and sections mesh
Record, briefly introduction, book cover etc.;Books Search searching system is mainly by customizing scheduler task, from book management system
Capture book information, including books essential information and book content information;Client is by paying close attention to the wechat public platform (temporarily name " palm
Upper reading "), " barcode scanning looks into book " and " take pictures and look into book " function is accessed into public platform, inquires related books reading.Specifically, it is
Reason under the overall leadership, the permission and subscriber management function of optimized integration;Local book information library preserves books essential information, fast and easy
Inquiry.Essential information includes:Title, author, the publication date, No. ISBN, books bar code value, chapters and sections catalogue, briefly introduce, books
Cover;Local book index library, file system preserve book content and index information, subsequently to carry out full text rope to content
Draw;Books Search service realizes that the comprehensive inquiry to book information is supported, query context is basic in local book information library
Information;Book retrieval engine is realized and is supported the full article retrieval of book content information;The ends PC service management entrance, provides
The business work(such as the service management portal, including user management, rights management, taking care of books of the ends PC browser access can be passed through
Energy;Book information sync cap, provide with external taking care of books system docking, can periodically capture newest book information, protect
Books essential information and identification book content information are deposited, book content full-text index information is generated;Information Mobile Service interface, provides shifting
Dynamic service general interface, handles mobile terminal uplink and downlink data, such as:OCR identifications are carried out to uploading pictures.Only connection wechat is public at present
Other movement end entrances subsequently can be achieved, such as in many accounts:Primary APP, mobile phone web pages etc.;Wechat public platform provides palm movement
Hold access entrance.
Moreover, daily timing initiates book information update application from Library Inquiry System to artwork book system;Artwork book system
System returns to the same day newly-increased increment book information, including books essential information and book content information;Books Search searching system
The update of increment book information is received, internal data arrangement processing is started;It receives and returns to book information, in the books identified such as OCR
Hold information;Start information distribution process, books essential information crawl, obtains books essential information;In books after being identified to OCR
Hold information and carries out automatic word segmentation;Book index information is created according to content word segmentation result;Preserve books essential information, book content
With book index information.
In addition, in embodiments of the present invention, based on the core technology of computer semantics identity ability, can help to calculate
The accurate meaning of machine more intelligently identification information behind.By the way that information is carried out deep layer, multi-level simulation tool, it has not only been understood
Code also identifies information intention to be expressed, makes computer more intelligent, more humanely and human communication.
Embodiment of the present invention has mainly used the technological means of metalanguage linear structure+keyword (i.e. language block), from
The real intention of information is accurately extracted in the linear structure and keyword of language.One sentence to be analyzed includes linear junction
Structure and keyword (i.e. language block).Wherein, the key of semantics identity is to identify the linear structure of sentence.The meaning of language is hidden
In the linear structure of sentence, the linear structure of sentence is equivalent to the constant of language.The meaning of one's words or even meaning and thinking are all hidden
In the linear structure of sentence, by the linear structure of anolytic sentence, the purpose that identification is intended to can reach.Keyword is equivalent to
The variable of language.By replacing corresponding portion (i.e. variable), the meaning of one's words can retain substantially, can accurately be retrieved or
Translation result.Moreover, bilingual, Dan Yujun can accurately identify the meaning of one's words using structural analysis.By to vast as the open sea document money
Material carries out linear structure+key word analysis sentence by sentence, we can obtain sufficient sentence linear structure and keyword (i.e. language block).
It illustrates:1,Rural tourismAsTourism Industry in ChinaImportant component and promotionTourism developmentIt is important
Support.(example 1);2,China's economicAsWorld economyImportant component and promotionGlobal finance is stablizedImportant branch
Support.(example 2)
By analyzing both the above example, it can be found that:
" rural tourism ", " Tourism Industry in China " and " tourism development " is equivalent to the variable of example 1, because by replacing phase
Should partly (i.e. variable), the meaning of one's words can retain substantially.And " important supports of the x as the important component and promotion x of x "
(wherein x indicates blank) is equivalent to the linear structure of example 1, that is, the constant of language, because the meaning of language is hidden in this
In linear structure.Similarly, " China's economic ", " world economy ", " Global finance stabilization " are equivalent to the variable of example 2, because
For by replacing corresponding portion (i.e. variable), the meaning of one's words can retain substantially.And " important components and promotion x of the x as x
Important support " (wherein x indicate blank) be equivalent to the linear structure of example 2, that is, the constant of language, because of the meaning of language
Justice is hidden in the linear structure.It can be found that the two exemplary linear structures are identical, variable is differed only in
It is different." important supports of the x as the important component and promotion x of x " (wherein x indicates blank) can be defined as a kind of line
Property structure, and " rural tourism ", " Tourism Industry in China ", " tourism development ", " China's economic ", " world economy " and " whole world gold
Melt stabilization " it is defined as keyword (i.e. language block).Wherein, we can determine some common inherent nouns and/or gerund
For constant, but variable is not limited to inherent noun and/or gerund.In some cases, variable can also be a kind of normal
Phrase or even long sentence.In addition, when determining constant and linear structure, dividing mode may not be unique
's.For the minimum dividing mode of variable, corresponding to linear structure be known as minimal linear structure.Usually, variable is fewer,
It is considered that the information expressed by corresponding linear structure is more abundant, then the information of corresponding search is more accurate.
It illustrates again:
1、A FandaUpsurge is swept acrossChina.(example 3);Speculation in stocksUpsurge is swept acrossThe world.(example 4)
By analyzing both the above example, it is found that " A Fanda " and " China " is equivalent to the variable of example 3, because
By replacing corresponding portion (i.e. variable), the meaning of one's words can retain substantially.And " x upsurges sweep across x " (wherein x indicates blank) is quite
In the constant of the linear structure of example 3, that is, language, because the meaning of language is hidden in the linear structure.Similarly,
" speculation in stocks " and " world " is equivalent to the variable of example 4, because by replacing corresponding portion (i.e. variable), the meaning of one's words can protect substantially
It stays.And " x upsurges sweep across x " (wherein x indicates blank) is equivalent to the linear structure of example 4, that is, the constant of language, because of language
The meaning of speech is hidden in the linear structure.It can be found that the two exemplary linear structures are identical, change is differed only in
Amount is different." x upsurges sweep across x " (wherein x indicate blank) can be defined as to a kind of linear structure, and " A Fanda ", " China ",
" speculation in stocks " and " world " is defined as keyword (i.e. language block).
It illustrates again:
1、TheyAppealEuropean CommissionObjectively and fairly treatThe MET (Market Economy Treatment) application of Chinese Enterprise.(example 5);2,International Football UnionAppealIrelandObjectively and fairly treatThe result of the match of qualifying match of World Cup and French team.(example 6);3,State Border societyAppealThe Six-Party TalksObjectively and fairly treatKorea problem.(example 7);4,ChinaAppealJapanese governmentIt is objective, just
It treats on groundWorld War II historical problem.(example 8);By analyzing four examples above, it can be found that:
" they ", " European Commission " and " the MET (Market Economy Treatment) application of Chinese Enterprise " are equivalent to the variable of example 5, because logical
Replacement corresponding portion (i.e. variable) is crossed, the meaning of one's words can retain substantially.And " x appeals that x objectively and fairly treats x " (wherein x tables
Show blank) it is equivalent to the linear structure of example 5, that is, the constant of language, because the meaning of language is hidden in the linear structure
In the middle.Similarly, " International Football Union ", " Ireland " and " result of the match of qualifying match of World Cup and French team " is equivalent to example 6
Variable because by replacing corresponding portion (i.e. variable), the meaning of one's words can retain substantially.And " x appeals that x is objectively and fairly right
Wait for that x " (wherein x indicates blank) is equivalent to the linear structure of example 6, that is, the constant of language, because the meaning of language is hidden
In the linear structure.Similarly, " international community ", " the Six-Party Talks " and " Korea problem " is equivalent to the variable of example 6, because
For by replacing corresponding portion (i.e. variable), the meaning of one's words can retain substantially.And " x appeals that x objectively and fairly treats x " is (wherein
X indicates blank) it is equivalent to the linear structure of example 6, that is, the constant of language, because the meaning of language is hidden in the linear junction
In structure.Similarly, " China ", " Japanese government " and " World War II historical problem " is equivalent to the variable of example 7, because passing through replacement
Corresponding portion (i.e. variable), the meaning of one's words can retain substantially.And " x appeals that x objectively and fairly treats x " (wherein x indicates blank)
It is equivalent to the linear structure of example 7, that is, the constant of language, because the meaning of language is hidden in the linear structure.It can
To find, this four exemplary linear structures are identical, differ only in variable difference.It can be by " x appeals x objectively and fairly
Treat x " (wherein x indicate blank) " be defined as a kind of linear structure, and " they ", " European Commission ", " market of Chinese Enterprise passes through
Help treatment application ", " International Football Union ", " Ireland ", " result of the match of qualifying match of World Cup and team of France ", " international community ",
" the Six-Party Talks ", " Korea problem ", " China ", " Japanese government " and " World War II historical problem " are defined as keyword (i.e. language block).
It is above-mentioned by being carried out to lot of documents (including web documents, blog, textbook, various electronic documents etc.) based on above-mentioned analysis
Cutting, we can be obtained by sufficient linear structure library and keyword (i.e. language block) library.
The natural language processing method the present invention is based on semantics identity is described in detail again below.Fig. 5 is according to of the invention real
Apply the natural language processing method flow diagram based on semantics identity of mode.Flow shown in Fig. 5 can be with the high in the clouds side of application drawing 1.
As shown in figure 5, this method includes:
Step 501:It is character string that the word of chapter grade, which is utilized symbol cutting, and extracts language from the character string cut out
Say linear structure and language block.Herein, the word of chapter grade (for example, an article or an editorial) is utilized into symbol first
Cutting is several character strings, and extracts language linear structure and language block (specific extraction successively from the character string cut out
Step is referred to aforementioned exemplary analysis)." chapter grade " is not meant to there is any specific restriction to the number of word herein.
Substantially, as long as there is some vocabulary, and the sentence formed between these vocabulary is meaningful, so that it may to think these vocabulary
It constitutes " chapter grade ".More specifically, can according to fullstop, question mark, exclamation, comma, pause mark, branch, colon, quotation marks, bracket,
Dash, ellipsis, mark of emphasis, hyphen, separation dot, punctuation marks used to enclose the title, line under or beside a word to show that it is a proper noun, annotation number, the number of avoiding mentioning, empty lacking number, slash,
Identification number, instead of number, like a chain of pearls or a string of beads number and/or the punctuation marks such as arrow number, be character string by the word segmentation of chapter grade.For example, can
Using by the Word Input between arbitrary two punctuation marks as the character string (starting for article, it is only necessary to punctuate symbol
Number).When determining keyword (language block), we can use a local substring statistical form (hash table) based on chapter
As interim auxiliary dictionary.That is, if there is in auxiliary dictionary temporarily, we can be determined as language block.
But, certain not appear in local substring statistical form, language block can also be determined as.It can also use and be based on multipath
The cutting route tree of planning is as segmentation model, first by English (ASCII), simplified form of Chinese Character (GBK/GB18030), Chinese-traditional
Character codes such as (Taiwan BIG5, Hong Kong BIG5-HKSCS) are uniformly converted to carries out cutting after UTF-8 coded formats again, and
Language block is extracted on the basis of multiple correct cutting results.After having extracted language block, remaining part is exactly linear structure.
Step 502:The language linear structure and language block that extract are arranged respectively.Herein, row is specific wraps
It includes:For the language block of each qualification, by number of documents, paragraph, sentence number, word order number and the HTML information where the language block
Equal one structure of boil down tos is put into the living document where the language block;Wherein language block can be arbitrary character string, main to wrap
Include following classification:It is dictionary entry, proper name, the internal vocabulary of proper name, all kinds of phrase/Matching Relations, n-grams, continuous
Stopwords, word+number, arbitrary ASCII strings, postcode and telephone number etc..And for the language linear junction of each qualification
Structure, can be by boil down tos such as number of documents, paragraph, sentence number, word order number and HTML informations where the language linear structure
One structure is put into the living document where the language block.
Step 503:Create language linear structure subindex and language block subindex, and by language linear structure subindex and
Language block subindex is merged, to form whole index.Herein, by whole language block index entry (index in memory
Terms language block vocabulary (vocabulary) file) is written, inv_lists files are written after invertedhits is merged, and
Dictionary (dictionary) file is written into related information between the two.These three files constitute a complete, independent rope
Draw section (index run), i.e. language block subindex.Moreover, whole linear structure index entries (index terms) in memory are write
Enter linear structure vocabulary (vocabulary) file, inv_lists files is written after inverted hits are merged, and will
Linear structural word allusion quotation (dictionary) file is written in related information between the two.These three files constitute one completely, solely
Vertical index segment (index run), i.e. linear structure subindex.Finally, by language linear structure subindex and language block subindex
Merged, to form whole index.
Step 504:It is inputted in character string from the retrieval of user and extracts language linear structure and language block, and according to described
Entirety indexes the information to match with the language linear structure and language block extracted from the retrieval input of user to user feedback.
Herein, it is inputted in character string from the retrieval of user first and extracts linear structure and language block.For example, if user inputs " I
It is delithted with the Big Apple for eating Yantai production." then extract language block " I ", the Big Apple of production " Yantai " and linear structure x and be delithted with
Eat x (wherein x be blank), then retrieved in integrally indexing matching linear structure " x, which is delithted with, eats x " and language block " I ",
The information of " Big Apple of Yantai production ", and presented to user according to the sequence of matching degree from high to low.In an embodiment party
In formula, when the language linear structure extracted in the retrieval input from the user and the language linear structure in whole index
When repetition number of words is more, it is believed that this matching degree is higher.In one embodiment, language line can also be pre-set
Property structure repeat weight and language block and repeat weight;Retrieval of the weight calculation from the user is repeated based on the language linear structure
First Chong Die index of the language linear structure extracted in input and the language linear structure in whole index, and it is based on language block
Weight calculation is repeated from retrieve the language block in the language block extracted in input and whole index second of the user is Chong Die to refer to
Number;When it is described first overlapping index index Chong Die with second and it is higher, the matching degree is higher.Wherein, to user feedback
The information to match with the language linear structure and language block extracted from the retrieval input of user can specifically include:Described
The language linear structure and language block for retrieving the input character string in whole index respectively, with determine in whole index with the input word
The corresponding language linear structure of language linear structure of string is accorded with, and determines the language block in whole index with the input character string
Corresponding language block;Involved by the corresponding language linear structure of this in integrally being indexed to user feedback and the corresponding language block
Information.
The flow of the present invention can be applied in a variety of specific practical applications, for example information retrieval and multilingual turned over
It translates.When applied to multilingual translation, it is assumed that the retrieval input character string of user is to input word with the retrieval that first language is stated
Symbol string.At this point, it is linear to extract the language that the input character string first language is stated from the retrieval input character string of user
Structure and language block;Then it determines again and the language linear structure with first language statement and language block is corresponding uses second language
The language linear structure and language block of statement;The language line stated to user feedback and with second language is indexed according to the entirety
The information that property structure and language block match and equally stated with second language.Wherein, first language can be Chinese, second language
For English, Japanese, Korean, Arabic, Spanish, Portuguese, French or Russian, etc..Optionally, the first language
Speech is English, Japanese, Korean, Arabic, Spanish, Portuguese, French or Russian, and second language is Chinese
Deng.Citing:User it is expected Chinese " I will go to Shanghai " translating into English.At this point, retrieval input character string input by user is
" I will go to Shanghai " is used in combination Chinese to state.First, it inputs in character string and is extracted in input character string use from the retrieval of user
The language linear structure of text statement is (i.e.:X will remove x, and wherein x is blank) and the language block (I, Shanghai) stated of Chinese;Then again really
Determine and language linear structure (the i.e. x want to go that in English states corresponding with the Chinese language linear structure of statement
To), and determine and the language block (i.e. I, Shanghai) stated in English corresponding with the Chinese language block of statement.Finally,
Language block and linear structure are combined into the sentence I want to go to Shanghai of translation, and are presented to the user.Further
Ground can also be indexed further according to whole to user feedback and linear structure (x want to go to) language block (I, Shanghai)
The information for matching and being stated with second language, consequently facilitating user search and I want to go to Shanghai are relevant
English information.
In above process, a kind of high performance single pass memory is exemplarily applied to fall to arrange algorithm, it is any without generating
Temporary disc file.Therefore, before exporting memory content, in addition to MAP data, system does not have any file I/O expense.Together
When, it need not also number index terms, and not appoint to index term (number or memory character string pointer)
What sort operation.In addition, this method is arranged using all available free physical memories.These properties ensure that this falls
Discharge method can have outstanding spatiotemporal efficiency, and a series of efficient dynamic indexs can be supported to merge the method with index upgrade.Together
When, the inverted index for having the characteristic is also completely suitable for distributed treatment.In above process, another key feature is it
Searching data structure has caching functions, this characteristic can support almost arbitrarily large index thesaurus (i.e. vocabulary texts
Part).Vocabulary files itself are placed on disk, and the number for the index entry that can be preserved is unrestricted (in 64-bit texts
In part system), it can up to several hundred million.By caching functions, which can reach on the x64 servers of 4~6GB memories
To with index thesaurus query performance similar in the cluster inquiry system including more same or higher configuration servers.Moreover,
Index terms can be arbitrary character string, include mainly following classification (term categories):Dictionary entry, specially
It is name, the internal vocabulary of proper name, all kinds of phrase/Matching Relations, n-grams, continuous stopwords, word+number, arbitrary
ASCII strings, postcode and telephone number etc..
Based on above-mentioned analysis, embodiment of the present invention also proposed a kind of natural language processing dress based on semantics identity
It sets.
Fig. 6 is the natural language processing device structure chart based on semantics identity according to embodiment of the present invention.It can incite somebody to action
The device is applied to the high in the clouds of Fig. 1.
As shown in fig. 6, the device includes the device include extraction unit 601, fall row unit 602, indexing units 603 and
With information feedback unit 604, wherein:Extraction unit 601 is character string for the word of chapter grade to be utilized symbol cutting, and
Language linear structure and language block are extracted from the character string cut out;Specifically, extraction unit 601 is first by the word of chapter grade
(for example, an article or an editorial) is several character strings using symbol cutting, and from the character string cut out successively
Extract language linear structure and language block (specific extraction step is referred to aforementioned exemplary analysis).More specifically, can root
According to fullstop, question mark, exclamation, comma, pause mark, branch, colon, quotation marks, bracket, dash, ellipsis, mark of emphasis, hyphen,
Every number, punctuation marks used to enclose the title, line under or beside a word to show that it is a proper noun, annotation number, the number of avoiding mentioning, empty lacking number, slash, identification number, instead of number, like a chain of pearls or a string of beads number and arrow number etc.
The word segmentation of chapter grade is character string by punctuation mark.For example, the word between arbitrary two punctuation marks can be carried
It is taken as character string (starting for article a, it is only necessary to punctuation mark).When determining keyword (language block), one can be used
A local substring statistical form (hash table) based on chapter is as interim auxiliary dictionary.That is, if there is facing
When auxiliary dictionary in, so that it may determined with being determined as language.But, certain not appear in local substring statistical form, also may be used
To be determined as language block.The cutting route tree based on multi-path planning can also be used as segmentation model, it first will be English
(ASCII), the character codes such as simplified form of Chinese Character (GBK/GB18030), Chinese-traditional (Taiwan BIG5, Hong Kong BIG5-HKSCS) are unified
It is converted to UTF-8 coded formats and carries out cutting again later, and language block is extracted on the basis of multiple correct cutting results.It has extracted
After language block, remaining part is exactly linear structure.Arrange unit 602, for respectively to the language linear structure that extracts with
And language block is arranged;Specifically, language block of the unit 602 for each qualification is arranged, by number of documents, the paragraph where the language block
Number, one structure of boil down tos such as sentence number, word order number and HTML information, be put into the living document where the language block;Wherein
Language block can be arbitrary character string, include mainly following classification:Dictionary entry, proper name, the internal vocabulary of proper name, all kinds of words
Group/Matching Relation, n-grams, continuous stopwords, word+number, arbitrary ASCII strings, postcode and telephone number etc..And
For the language linear structure of each qualification, arranging unit 502 can be by number of documents, the paragraph where the language linear structure
Number, one structure of boil down tos such as sentence number, word order number and HTML information, be put into the living document where the language block.
Indexing units 603, for creating language linear structure subindex and language block subindex, and by language linear structure
Subindex and language block subindex are merged, to form whole index;Specifically, indexing units 603 are by whole languages in memory
Vocabulary files are written in block index entry (index terms), write-in inv_lists texts after inverted hits are merged
Part, and dictionary files are written into related information between the two.These three files constitute a complete, independent index
Section (index run), i.e. language block subindex.Moreover, whole linear structure index entries (index terms) in memory are written
Vocabulary files, will inverted hits merge after be written inv_lists files, and by related information between the two
Dictionary files are written.These three files constitute a complete, independent index segment (index run), i.e. linear structure
Subindex.
Finally, indexing units 603 are merged language linear structure subindex and language block subindex, to form whole rope
Draw.Match information feedback unit 604 extracts language linear structure and language block for being inputted in character string from the retrieval of user,
And it is indexed according to the entirety and retrieves the language linear structure and language block extracted in input with from user to user feedback
The information to match.In one embodiment, match information feedback unit 604, for according to language linear structure and language block
Matching degree sequence from high to low, to user feedback and the language linear structure extracted from the retrieval input of user and
The information that language block matches.Moreover, when the language linear structure extracted in the retrieval input from the user is indexed with whole
In language linear structure repetition number of words it is more when, the matching degree is higher.In one embodiment, match information is anti-
Unit 604 is presented, is further used for pre-setting language linear structure repetition weight and language block repeats weight;And it is based on the language
Linear structure repeats weight calculation from the language linear structure and whole index extracted in the retrieval input of the user
First overlapping index of language linear structure, and repeat to extract during weight calculation is inputted from the retrieval of the user based on language block
Language block and the language block in whole index the second Chong Die index;Wherein when the first overlapping index index Chong Die with second
Higher, the matching degree is higher.In one embodiment, match information feedback unit 604, in the whole rope
Draw the middle language linear structure and language block for retrieving the input character string respectively, with determine in whole index with the input character string
The corresponding language linear structure of language linear structure, and determine corresponding with the language block of the input character string in whole index
Language block;Letter involved by the corresponding language linear structure of this in integrally being indexed to user feedback and the corresponding language block
Breath.In one embodiment, the retrieval input character string of user is to input character string with the retrieval that first language is stated;This
When, match information feedback unit 604 extracts the first language of the input character string for being inputted in character string from the retrieval of user
Say the language linear structure and language block of statement;Determining language linear structure and language block with this with first language statement is corresponding
The language linear structure and language block stated with second language;It is indexed to user feedback according to the entirety and uses second language table
The information that the language linear structure and language block stated match and equally stated with second language.
Based on above-mentioned detailed description, embodiment of the present invention also proposed a kind of natural language processing based on semantics identity
System.Fig. 7 is the natural language processing system structure chart based on semantics identity according to embodiment of the present invention.As shown in fig. 7,
The system includes information collection apparatus 301, data storage device 302, natural language processing device 303, index storage device 304
With retrieval service device 305.Wherein:Information collection apparatus 301 is crawled for being scanned detection to internet on internet
Information;Data storage device 302 for storing the internet information crawled by information collection apparatus, and preferably provides mutually
The quick positioning searching of networked information;Natural language processing device 303, for using symbol to being stored in data storage device
The word of chapter grade in 302, cutting are character string, and language linear structure and language block are extracted from the character string cut out;
And the language linear structure and language block that extract are arranged respectively;And for create language linear structure subindex with
And language block subindex, and language linear structure subindex and language block subindex are merged, to form whole index;Index is deposited
Storage device 304, for storing the whole index generated by natural language processing device 303;Retrieval service device 305, for from
Language linear structure and language block are extracted in the retrieval input character string of user, and according to the described of index storage device storage
Entirety indexes the information to match with the language linear structure and language block extracted from the retrieval input of user to user feedback.
Wherein, information collection apparatus 301 may further receive the upload information that newpapers and periodicals, broadcasting and TV and each media member etc. are provided
(such as News Resources) service.Moreover, retrieval service device 305 can inquire news free of charge for ordinary user, and it is directed to
Professional user registers and opens high-end business after paying.Preferably, natural language processing device 303, for according to fullstop, ask
Number, exclamation, comma, pause mark, branch, colon, quotation marks, bracket, dash, ellipsis, mark of emphasis, hyphen, separation dot, title
Number, line under or beside a word to show that it is a proper noun, annotation number, the number of avoiding mentioning, empty lacking number, slash, identification number, instead of number, like a chain of pearls or a string of beads number and arrow number, by the chapter
The word segmentation of grade is character string.Preferably, natural language processing device 303, for being united using the local substring based on chapter
Table is counted as interim auxiliary dictionary, uses the cutting route tree based on multi-path planning as segmentation model, by the chapter grade
The equal character codes of word are uniformly converted to UTF-8 coded formats;And the word to being converted to the chapter grade after UTF-8 coded formats
It is character string using symbol cutting.Moreover, retrieval service device 305, can be used for the sequence from high to low according to matching degree,
The information to match to user feedback with the language linear structure and language block extracted from the retrieval input of user.In a reality
It applies in mode, retrieval service device 305, is used for the sequence from high to low according to the matching degree of language linear structure and language block,
The information to match to user feedback with the language linear structure and language block extracted from the retrieval input of user.Wherein, excellent
Selection of land, when the language linear structure extracted in the retrieval input from the user and the language linear structure in whole index
When repetition number of words is more, the matching degree is higher.
In one embodiment, retrieval service device 305 is further used for pre-setting language linear structure repetition power
Weight and language block repeat weight;And it repeats to extract during weight calculation is inputted from the retrieval of the user based on the language linear structure
First Chong Die index of the language linear structure gone out and the language linear structure in whole index, and weight meter is repeated based on language block
Calculate the second Chong Die index of the language block and the language block in whole index that are extracted from the retrieval input of the user;Wherein work as institute
State the first overlapping index index Chong Die with second and higher, the matching degree is higher.In one embodiment, retrieval clothes
Business device 305, language linear structure and language block for retrieving the input character string respectively in the whole index, with determination
Language linear structure corresponding with the language linear structure of the input character string in whole index, and determine in whole index
Language block corresponding with the language block of the input character string;The corresponding language linear structure of this in integrally being indexed to user feedback and
Information involved by the corresponding language block.In one embodiment, retrieval service device 305, for the retrieval from user
The language linear structure and language block of input character string first language statement are extracted in input character string;Determine with this with the
The corresponding language linear structure and language block stated with second language of language linear structure and language block of one language expression;According to
The entirety, which is indexed, to match to user feedback with the language linear structure and language block stated with second language and equally with the
The information of two language expressions.Optionally, first language be English, Japanese, Korean, Arabic, Spanish, Portuguese,
French or Russian etc., second language are Chinese.First language can also be Chinese, and second language is English, Japanese, Korea Spro
Text, Arabic, Spanish, Portuguese, French or Russian, etc..
In conclusion in embodiments of the present invention, shooting word content, and shooting photo is sent to high in the clouds;High in the clouds pair
The shooting photo executes optical character identification so that the shooting photo is converted to word content, is examined based on the word content
Rope related content, and push the retrieval result for including the related content.After embodiment of the present invention, use can be facilitated
Family quick-searching, and play relevant multimedia messages.As it can be seen that embodiment of the present invention, which can be based on identification content, executes inspection
Rope is convenient for user's quick search.
Moreover, user can also utilize access entry to play relevant multimedia messages, it is user-friendly.
The foregoing is only a preferred embodiment of the present invention, is not intended to limit the scope of the present invention.It is all
Within the spirit and principles in the present invention, any modification, equivalent replacement, improvement and so on should be included in the protection of the present invention
Within the scope of.
Claims (10)
1. a kind of information retrieval method based on Text region, which is characterized in that this method includes:
Word content is shot, and sends shooting photo to high in the clouds;
High in the clouds executes optical character identification the shooting photo is converted to word content, based on described to the shooting photo
Word content retrieves related content, and pushes the retrieval result for including the related content.
2. the information retrieval method according to claim 1 based on Text region, which is characterized in that in the shooting word
Hold, and sends shooting photo and be to high in the clouds:Word content is shot using wearable device, and is sent using the wearable device
Photo is shot to high in the clouds;
It is described that related content is retrieved based on word content, and push the retrieval result comprising the related content and include:High in the clouds base
Relevant audio file or video file are retrieved in the word content, and the audio file or video file are sent to institute
State wearable device;
This method further includes:
The wearable device plays the audio file or video file.
3. the information retrieval method according to claim 1 based on Text region, which is characterized in that in the shooting word
Hold, and sends shooting photo and be to high in the clouds:Word content is shot using mobile terminal, and shooting is sent using the mobile terminal
Photo is to high in the clouds;
It is described that related content is retrieved based on word content, and push the retrieval result comprising the related content and include:High in the clouds base
Relevant audio file or video file are retrieved in the word content, and the audio file or video file are sent to institute
State mobile terminal;
This method further includes:
Audio file or video file described in the mobile terminal playing.
4. the information retrieval method according to claim 1 based on Text region, which is characterized in that this method is also wrapped in advance
It includes:
The word of chapter grade is character string using symbol cutting by high in the clouds, and language linear junction is extracted from the character string cut out
Structure and language block respectively arrange the language linear structure and language block that extract, create language linear structure subindex with
And language block subindex, and language linear structure subindex and language block subindex are merged, to form whole index;
It is described that related content is retrieved based on the word content, and push the retrieval result comprising the related content and include:
The language linear structure and language block of the word content are extracted, and according to the whole index push and out of this word
The information that the language linear structure and language block extracted in appearance matches.
5. the information retrieval method according to claim 4 based on Text region, which is characterized in that described by chapter grade
Word includes for character string using symbol cutting:
Using the local substring statistical form based on chapter as interim auxiliary dictionary, with the cutting route tree based on multi-path planning
As segmentation model, the word character code of the chapter grade is uniformly converted into UTF-8 coded formats;To being converted to UTF-8
The word of chapter grade after coded format is character string using symbol cutting;
Matched with the language linear structure and language block extracted from the word content according to the whole index push
Information includes:According to the sequence of the matching degree of language linear structure and language block from high to low, push is carried from the word content
The information that the language linear structure and language block of taking-up match, wherein when the language linear structure extracted from the word content
When more with the repetition number of words of the language linear structure in whole index, the matching degree is higher.
6. the information retrieval method according to claim 5 based on Text region, which is characterized in that
This method further includes:It pre-sets language linear structure and repeats weight and language block repetition weight;
The language linear structure and entirety that weight calculation is extracted from the word content are repeated based on the language linear structure
First overlapping index of the language linear structure in index, and weight calculation is repeated based on language block and is extracted from the word content
Language block and the language block in whole index the second Chong Die index;When it is described first overlapping index index Chong Die with second and get over
Height, the matching degree are higher.
7. a kind of information retrieval system based on Text region, which is characterized in that the system includes:
Filming apparatus for shooting word content, and sends shooting photo to high in the clouds;
Searching system is located at high in the clouds, for executing optical character identification to the shooting photo to convert the shooting photo
For word content, related content is retrieved based on the word content, and include the inspection of the related content to filming apparatus push
Hitch fruit.
8. the information retrieval system according to claim 7 based on Text region, which is characterized in that
The filming apparatus is wearable device;
Searching system, for retrieving relevant audio file or video file based on the word content, and the audio is literary
Part or video file are sent to the wearable device;
The wearable device is additionally operable to play the audio file or video file.
9. the information retrieval system according to claim 7 based on Text region, which is characterized in that
The filming apparatus is mobile terminal;
Searching system, for retrieving relevant audio file or video file based on the word content, and the audio is literary
Part or video file are sent to the mobile terminal;
The mobile terminal is additionally operable to play the audio file or video file.
10. the information retrieval system according to claim 7 based on Text region, which is characterized in that
Searching system is additionally operable to be in advance character string using symbol cutting by the word of chapter grade, and from the character string cut out
Language linear structure and language block are extracted, the language linear structure and language block that extract are arranged respectively, creates language
Linear structure subindex and language block subindex, and language linear structure subindex and language block subindex are merged, with shape
Integral index;It is described that related content is retrieved based on the word content, and push the retrieval result for including the related content
Including:Extract the language linear structure and language block of the word content, and according to the whole index push with from the word
The information that the language linear structure and language block extracted in content matches;
Wherein include for character string using symbol cutting by the word of chapter grade:
Using the local substring statistical form based on chapter as interim auxiliary dictionary, with the cutting route tree based on multi-path planning
As segmentation model, the word character code of the chapter grade is uniformly converted into UTF-8 coded formats;To being converted to UTF-8
The word of chapter grade after coded format is character string using symbol cutting;
Matched with the language linear structure and language block extracted from the word content according to the whole index push
Information includes:According to the sequence of the matching degree of language linear structure and language block from high to low, push is carried from the word content
The information that the language linear structure and language block of taking-up match, wherein when the language linear structure extracted from the word content
When more with the repetition number of words of the language linear structure in whole index, the matching degree is higher.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710251901.1A CN108733687A (en) | 2017-04-18 | 2017-04-18 | A kind of information retrieval method and system based on Text region |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710251901.1A CN108733687A (en) | 2017-04-18 | 2017-04-18 | A kind of information retrieval method and system based on Text region |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108733687A true CN108733687A (en) | 2018-11-02 |
Family
ID=63924739
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710251901.1A Pending CN108733687A (en) | 2017-04-18 | 2017-04-18 | A kind of information retrieval method and system based on Text region |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108733687A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111489595A (en) * | 2020-04-16 | 2020-08-04 | 广东小天才科技有限公司 | Method and device for feeding back test information in live broadcast teaching process |
CN111489596A (en) * | 2020-04-16 | 2020-08-04 | 广东小天才科技有限公司 | Method and device for information feedback in live broadcast teaching process |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101668071A (en) * | 2009-08-14 | 2010-03-10 | 惠州Tcl移动通信有限公司 | Mobile communication terminal with scanning function and implement method thereof |
CN102789464A (en) * | 2011-05-20 | 2012-11-21 | 陈伯妤 | Natural language processing method, device and system based on semanteme recognition |
CN104199834A (en) * | 2014-08-04 | 2014-12-10 | 徐�明 | Method and system for interactively obtaining and outputting remote resources on surface of information carrier |
CN104217197A (en) * | 2014-08-27 | 2014-12-17 | 华南理工大学 | Touch reading method and device based on visual gestures |
-
2017
- 2017-04-18 CN CN201710251901.1A patent/CN108733687A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101668071A (en) * | 2009-08-14 | 2010-03-10 | 惠州Tcl移动通信有限公司 | Mobile communication terminal with scanning function and implement method thereof |
CN102789464A (en) * | 2011-05-20 | 2012-11-21 | 陈伯妤 | Natural language processing method, device and system based on semanteme recognition |
CN104199834A (en) * | 2014-08-04 | 2014-12-10 | 徐�明 | Method and system for interactively obtaining and outputting remote resources on surface of information carrier |
CN104217197A (en) * | 2014-08-27 | 2014-12-17 | 华南理工大学 | Touch reading method and device based on visual gestures |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111489595A (en) * | 2020-04-16 | 2020-08-04 | 广东小天才科技有限公司 | Method and device for feeding back test information in live broadcast teaching process |
CN111489596A (en) * | 2020-04-16 | 2020-08-04 | 广东小天才科技有限公司 | Method and device for information feedback in live broadcast teaching process |
CN111489595B (en) * | 2020-04-16 | 2022-06-24 | 广东小天才科技有限公司 | Method and device for feeding back test information in live broadcast teaching process |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP5746286B2 (en) | High-performance data metatagging and data indexing method and system using a coprocessor | |
CN101452470B (en) | Summary-style network search engine system and search method and uses | |
Huston et al. | Evaluating verbose query processing techniques | |
US8577882B2 (en) | Method and system for searching multilingual documents | |
CN102789464B (en) | Natural language processing methods, devices and systems based on semantics identity | |
CN109902288A (en) | Intelligent clause analysis method, device, computer equipment and storage medium | |
US20080201314A1 (en) | Method and apparatus for using multiple channels of disseminated data content in responding to information requests | |
CA2774278A1 (en) | Methods and systems for extracting keyphrases from natural text for search engine indexing | |
Jabbar et al. | A survey on Urdu and Urdu like language stemmers and stemming techniques | |
AU2013290306A1 (en) | Weight-based stemming for improving search quality | |
CN101000611A (en) | Method for providing and inquiry information for public by interconnection network | |
CN110807326A (en) | Short text keyword extraction method combining GPU-DMM and text features | |
CN102117285B (en) | Search method based on semantic indexing | |
Subhashini et al. | Shallow NLP techniques for noun phrase extraction | |
WO2017000659A1 (en) | Enriched uniform resource locator (url) identification method and apparatus | |
CN108733687A (en) | A kind of information retrieval method and system based on Text region | |
Modi et al. | Multimodal web content mining to filter non-learning sites using NLP | |
CN113934910A (en) | Automatic optimization and updating theme library construction method and hot event real-time updating method | |
Li et al. | CLC-RS: a Chinese legal case retrieval system with masked language ranking | |
Chen et al. | Chinese named entity abbreviation generation using first-order logic | |
KR20020006223A (en) | Automatic Indexing Robot System And A Method | |
KR20130142192A (en) | Assistance for video content searches over a communication network | |
CN105488035A (en) | Conversational natural language processing method and device | |
Sheng et al. | Cross-Language Text Search Algorithm Based on Context-Compatible Algorithms | |
Zuo et al. | Cross-Genre Retrieval for Information Integrity: A COVID-19 Case Study |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20181102 |
|
WD01 | Invention patent application deemed withdrawn after publication |