CN109002423A

CN109002423A - text search method and device

Info

Publication number: CN109002423A
Application number: CN201710417318.3A
Authority: CN
Inventors: 刘珅珅
Original assignee: Peking University Founder Group Co Ltd; Beijing Founder Electronics Co Ltd
Current assignee: Peking University Founder Group Co Ltd; Beijing Founder Electronics Co Ltd
Priority date: 2017-06-06
Filing date: 2017-06-06
Publication date: 2018-12-14

Abstract

The present invention provides a kind of text search method and device, which comprises encodes to keyword and text to be searched, obtains corresponding first coding of the keyword and encode with the text corresponding second to be searched；Second coding is divided with preset dimension, obtains multiple code sets；For each code set, matching search is carried out in the code set according to first coding, obtains the coding section with first codes match；According to the position of the bebinning character of the coding section and termination character in the text to be searched, search result is obtained.The present invention is by being scanned for keyword and text Unified coding to be searched and as unit of preset dimension, improving search efficiency and reducing search error.

Description

Text search method and device

Technical field

The present invention relates to text-processing field more particularly to a kind of text search method and devices.

Background technique

Data search is a critical function in document reader, interior required for user can be helped effectively to obtain Hold and information, existing text search technology scan in such a way that text compares, but since the coding of text is not united One causes inefficiency, and cannot be distinguished from paragraph and be easy to cause search error.

Summary of the invention

The present invention provides a kind of text search method and device, by by keyword and text Unified coding to be searched, simultaneously And it is scanned for as unit of preset dimension, improve search efficiency and reduces search error.

One aspect of the present invention provides a kind of text search method, comprising: encodes, obtains to keyword and text to be searched Corresponding first coding of the keyword is obtained to encode with the text corresponding second to be searched；Described in the division of preset dimension Second coding, obtains multiple code sets；For each code set, according to first coding in the code set into Row matching search, obtains the coding section with first codes match；According to the bebinning character and termination character of the coding section Position in the text to be searched obtains search result.

Another aspect of the present invention provides a kind of text search device, comprising:

Conversion module obtains the keyword corresponding first and compiles for encoding to keyword and text to be searched Code is encoded with the text corresponding second to be searched；

Division module obtains multiple code sets for dividing second coding with preset dimension；

Search module, for being directed to each code set, according to the first coding progress in the code set With search, the coding section with first codes match is obtained；

Module is obtained, is also used to bebinning character and termination character according to the coding section in the text to be searched Position obtains search result.

The present invention provides a kind of text search method and devices, by compiling search key and text to be searched unification Code, and scanned for as unit of preset dimension as needed, improve the efficiency of text search and reduce search Error.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is the present invention Some embodiments for those of ordinary skill in the art without creative efforts, can also basis These attached drawings obtain other attached drawings.

Fig. 1 is a kind of flow diagram for text search method that the embodiment of the present invention one provides；

Fig. 2 is the flow diagram for another text search method that the embodiment of the present invention one provides；

Fig. 3 is the flow diagram for another text search method that the embodiment of the present invention one provides；

Fig. 4 is a kind of structural schematic diagram of text search device provided by Embodiment 2 of the present invention；

Fig. 5 is the structural schematic diagram of another text search device provided by Embodiment 2 of the present invention.

Specific embodiment

In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art Every other embodiment obtained without creative efforts, shall fall within the protection scope of the present invention.

Fig. 1 is a kind of flow diagram for text search method that the embodiment of the present invention one provides, as shown in Figure 1, comprising:

101, keyword and text to be searched are encoded, obtain the keyword it is corresponding first coding and it is described to Search for corresponding second coding of text.

Specifically, the coding mode of keyword and text to be searched can be learnt according to the platform where text to be searched, It common are GBK, UTF-8 etc., then keyword and text to be searched carried out using same cross-platform standard character collection Coding.

102, second coding is divided with preset dimension, obtains multiple code sets.

Wherein, the dimension can be determined according to search precision, for example, can be using page as dimension, correspondingly, then for every One page scans for, or can also be using paragraph as dimension, then correspondingly, scanning for for each paragraph.Specifically, user Preset dimension can be defined according to the number of pages of the hardware case of platform, the demand of itself and text to be searched；Such as when The number of pages of poor, the to be searched text of the hardware performance of platform is less and it is desirable that when obtaining partial search results as early as possible, then can incite somebody to action Preset dimension is defined as paragraph.

103, it is directed to each code set, matching search is carried out in the code set according to first coding, is obtained Obtain the coding section with first codes match.

Specifically, wherein the character sum of each code set should be total more than or equal to the character of first coding Number.

104, the position according to the bebinning character of the coding section and termination character in the text to be searched, acquisition are searched Hitch fruit.

Specifically, can be obtained according to the position of the bebinning character of coding section and termination character in text to be searched wait search In Suo Wenben with the text of Keywords matching starting text and end text the coordinate in text to be searched.

For example with actual scene: assuming that current search term is " text ", and having such two in text to be searched A adjacent paragraph, ending and starting are respectively " ... context " and " herein ... ".Based on this programme, for example, being with paragraph Dimension scans for, and matching search is carried out in each section of corresponding text can then avoid the occurrence of the text and the in first segment The case where this in two sections is as search result, to improve the accuracy and reliability of text search.

In order to improve the reliability of text search, further, on the basis of the embodiment of the present invention one, as shown in Fig. 2, For the flow diagram for another text search method that the embodiment of the present invention one provides, 101 are specifically included:

201, according to the coding mode of the keyword, corresponding first char ocra font ocr is converted by the keyword String；Unicode coding is carried out to the first char ocra font ocr string, obtains first coding.

202, according to the coding mode of the text to be searched, corresponding 2nd char is converted by the text to be searched Ocra font ocr string；Unicode coding is carried out to the 2nd char ocra font ocr string, obtains second coding.

Specifically, in order to eliminate the problem of different coding modes brings text search, by keyword and text to be searched Originally be converted into corresponding char ocra font ocr string, then by the char ocra font ocr string of acquisition according to the Unicode coded set of standard into Row coding obtains corresponding first coding of keyword second coding corresponding with text to be searched；The above method is improved not With the reliability for searching for text under coding mode.

In order to further increase the efficiency of text search, on the basis of the embodiment of the present invention one, as shown in figure 3, for this The flow diagram for another text search method that inventive embodiments one provide, 103 specifically include:

301, using the bebinning character of the code set as current starting point；Compare first coding and volume to be compared Code, the character string to be compared for being encoded to current starting point and its continuous N-1 character forms later, N is described first The length of coding.

Specifically, coding to be compared is identical as the length of the first coding, the value of N is the positive integer more than or equal to 1.

If 302, consistent, determine that the coding to be compared belongs to the coding section with first codes match, and by institute Adjacent character after stating coding to be compared as current starting point, and return execute it is described compare it is described first coding and it is to be compared The step of coding, until the character in the code set was compared.

Specifically, whether compare the corresponding Unicode coding of the first coding and coding to be compared identical.

If 303, inconsistent, detect whether adjacent the first character after current starting point belongs to first coding；If Belong to, then using the m-th character before first character as current starting point, and returns and execute the comparison first volume The step of code and coding to be compared, until the character in the code set was compared, wherein M is first coding In length between the character of first character match and the bebinning character of first coding；It, will be current if being not belonging to Adjacent character after starting point as current starting point, and return execute the detection current starting point after adjacent the first character The step of whether belonging to the described first coding, until the character in the code set was compared.

104 specifically include:

304, whether the bebinning character and termination character for detecting the coding section are in same in the text to be searched Row；If so, using the text between the bebinning character and termination character position as described search result；If it is not, Then by from the bebinning character position to the text and the knot between bebinning character end position of the row The text to the termination character position is played in beam character initial position of the row, as described search result.

Specifically, by bebinning character and termination character wait the coordinate in searched for text be set as (X1, Y1) and (X2, Y2), W by search text width；It is compared according to the ordinate Y2 of the ordinate Y1 of bebinning character and termination character, it can be with All characters of coding section are learnt whether in same a line, if so, text of the coordinate between (X1, Y1) and (X2, Y2) is made For search result；If it is not, can learn all characters of coding section not in same a line, then by coordinate value be (X1, Y1) with (W, Y1 the text between) is plus text of the coordinate value between (0, Y2) and (X2, Y2) as search result.

For example with actual scene: assuming that current first is encoded to " EDH " and code set as " EDHCFEDH ", compiling The bebinning character " E " that code collection is closed is used as current starting point, to be compared to be encoded to " EDH "；Start execute compare first coding and to Compare coding the step of, result be it is consistent, then determine coding " EDH " to be compared belong to and first encode " EDH " matched coding Section simultaneously regard the adjacent character " C " after coding to be compared as current starting point；It returns to execute and compares the first coding and volume to be compared The step of code, " CFE " is used as coding to be compared in code set at this time, and matching result is inconsistent at this time, then detection is current rises Whether adjacent the first character " F " belongs to the first coding after initial point, and result is to be not belonging to, then by the first character " F " as current Starting point；The step of whether the first character adjacent after executing detection current starting point belongs to the first coding is returned to, until result To belong to, the first character is " D " at this time, calculates the length in the first coding between character " D " and its bebinning character, is grown Degree is 1, and regard the 1st character " E " before above-mentioned first character as current starting point；It is then returned to execution and compares the first coding The step of with coding to be compared, it is consistent for obtaining result；Until the character in code set was compared, it is always obtained two " EDH " coding section；Then the bebinning character " E " and termination character " H " for detecting each " EDH " coding section are in text to be searched Whether it is in same a line in " EDHCFEDH ", if so, be highlighted from bebinning character " E " to the text termination character " H ", If not being then highlighted the beginning of bebinning character " E " to text and its next line between the last character of its current row Character is to the text between termination character " H ".

For search result prominent in text to be searched, so that user checks, further, in aforementioned any implementation On the basis of mode, after 104, can also include:

305, highlighted processing is carried out to described search result.

Specifically, user can according to need the highlighted attribute, such as color, transparency, font etc. of setting.

A kind of text search method is present embodiments provided, by by search key and text Unified coding to be searched, And user can according to need to be scanned for as unit of preset dimension, is improved the efficiency of text search and is reduced Search for error.

Fig. 4 is a kind of structural schematic diagram of text search device provided by Embodiment 2 of the present invention, as shown in figure 4, the dress It sets and includes:

Conversion module 41 obtains the keyword corresponding first for encoding to keyword and text to be searched Coding is encoded with the text corresponding second to be searched；

Division module 42 obtains multiple code sets for dividing second coding with preset dimension；

Search module 43 carries out in the code set for being directed to each code set according to first coding Matching search, obtains the coding section with first codes match；

Module 44 is obtained, is also used to bebinning character and termination character according to the coding section in the text to be searched Position, obtain search result.

Specifically, conversion module 41 can learn keyword and text to be searched according to the platform where text to be searched Coding mode then carries out coding using same cross-platform standard character collection to keyword and text to be searched and will obtain Coding result be sent to division module 42.Preset dimension can be determined according to search precision in division module 42, and its In the character sum of each code set should be more than or equal to the character sum of first coding, and by the coding of acquisition Set is sent to search module 43.It obtains in module 44 according to the bebinning character and termination character of coding section in text to be searched Position can obtain in text to be searched with the text of Keywords matching starting text and end text in text to be searched Coordinate in this.

On the basis of embodiment shown in Fig. 4, Fig. 5 is another text search device provided by Embodiment 2 of the present invention Structural schematic diagram, as shown in figure 5, the conversion module 41 includes:

First coding unit 411 converts the keyword to corresponding for the coding mode according to the keyword First char ocra font ocr string；

Second coding unit 412 obtains described for carrying out Unicode coding to the first char ocra font ocr string One coding；

First coding unit 411 is also used to the coding mode according to the text to be searched, and the text to be searched is turned Turn to corresponding 2nd char ocra font ocr string；

Second coding unit 412 is also used to carry out Unicode coding to the 2nd char ocra font ocr string, described in acquisition Second coding.

Specifically, in order to eliminate the problem of different coding modes brings text search, the first coding unit 411 will be closed Keyword and text to be searched are converted into corresponding char ocra font ocr string, and the char ocra font ocr string of acquisition is sent to the second coding Unit 412, the second coding unit 412 are encoded according to the Unicode coded set of standard, available keyword corresponding One coding, second coding corresponding with text to be searched；The above method is improved searches for the reliable of text under different coding mode Property.

In practical application, carry out matching search mode there are many, optionally, on the basis of aforementioned any embodiment On, search module 43 includes:

Selecting unit 431, for using the bebinning character of the code set as current starting point；

Comparing unit 432, it is described to be compared to be encoded to current starting for comparing first coding and coding to be compared Point and its later character string of continuous N-1 character composition, N are the length of first coding；

Processing unit 433, if for comparison result be it is consistent, determine it is described it is to be compared encode belong to and it is described first volume The matched coding section of code, and using the adjacent character after the coding to be compared as current starting point, and return and execute the ratio The step of to first coding and coding to be compared, until the character in the code set was compared；The processing Unit, if be also used to the comparison result be it is inconsistent, detect whether adjacent the first character after current starting point belongs to institute State the first coding；If belonging to, using the m-th character before first character as current starting point, and return described in execution The step of comparing first coding and coding to be compared, until the character in the code set was compared, wherein M For the length in first coding between the character of first character match and the bebinning character of first coding；If It is not belonging to, using the adjacent character after current starting point as current starting point, and returns and execute the detection volume to be compared Whether the first adjacent character belongs to the step of the described first coding after code, until the character in the code set is compared It crosses.

Specifically, selecting unit 431 obtains the first character of code set and is set as current starting point, comparing unit 432 obtain the character string with the first coding equal length from coding to be compared, are then compared and will be right with the first coding It is sent to processing unit 433 than result, when the comparison result that processing unit 433 obtains is consistent, by the phase after coding to be compared Adjacent character is as current starting point and is sent to comparing unit 432；When the comparison result that processing unit 433 obtains is inconsistent, It obtains adjacent the first character after current starting point and first character belongs to the first coding, obtained according to the position of the first character The position of new current starting point is simultaneously sent to comparing unit 432, until the character in code set was compared.

In addition, in order to avoid the initial position of the coding section because of matching search acquisition and end position do not cause in same a line Search result inaccuracy, optionally, on the basis of aforementioned any embodiment, obtain module 44 include:

Detection unit 441, for detecting the bebinning character of the coding section obtained from the matching module and terminating word Whether symbol is in same a line in the text to be searched；

Acquiring unit 442, if for testing result be it is yes, will be between the bebinning character and termination character position Text as described search result；

Acquiring unit 442, if be also used to the testing result be not, will from the bebinning character position to Text and termination character initial position of the row between the bebinning character end position of the row rise to described Text between termination character position, as described search result.

Specifically, detection unit 441 is according to the bebinning character of coding section and the seat in text to be searched of termination character Mark judges that bebinning character and termination character whether in same a line, and will test result and are sent to acquiring unit 442, acquiring unit The testing result that 442 bases obtain is using corresponding text as search result.

It is optional again, on the basis of aforementioned any embodiment, the device further include:

Labeling module 45, for carrying out highlighted processing to the described search result obtained from the acquisition module.

Specifically, the highlighted processing search result of labeling module 45, and the attribute of highlighted processing can be set, to optimize user Experience.

Present embodiments provide a kind of text search device, user can be by the device by search key and to be searched Text Unified coding, and user can according to need and be scanned for as unit of preset dimension, and text search is improved Efficiency and reduce search error.

It is apparent to those skilled in the art that for convenience and simplicity of description, the device of foregoing description Specific work process, can refer to corresponding processes in the foregoing method embodiment, details are not described herein.

Those of ordinary skill in the art will appreciate that: realize that all or part of the steps of above-mentioned each method embodiment can lead to The relevant hardware of program instruction is crossed to complete.Program above-mentioned can be stored in a computer readable storage medium.The journey When being executed, execution includes the steps that above-mentioned each method embodiment to sequence；And storage medium above-mentioned include: ROM, RAM, magnetic disk or The various media that can store program code such as person's CD.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention., rather than its limitations；To the greatest extent Pipe present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: its according to So be possible to modify the technical solutions described in the foregoing embodiments, or to some or all of the technical features into Row equivalent replacement；And these are modified or replaceed, various embodiments of the present invention technology that it does not separate the essence of the corresponding technical solution The range of scheme.

Claims

1. a kind of text search method characterized by comprising

Keyword and text to be searched are encoded, corresponding first coding of the keyword and the text to be searched are obtained Corresponding second coding；

Second coding is divided with preset dimension, obtains multiple code sets；

For each code set, matching search is carried out in the code set according to first coding, obtain with it is described The coding section of first codes match；

According to the position of the bebinning character of the coding section and termination character in the text to be searched, search result is obtained.

2. obtaining the keyword the method according to claim 1, wherein described encode keyword Corresponding first coding, comprising:

According to the coding mode of the keyword, corresponding first char ocra font ocr string is converted by the keyword；

Unicode coding is carried out to the first char ocra font ocr string, obtains first coding；

It is described by text conversion to be searched be corresponding second code set, comprising:

According to the coding mode of the text to be searched, corresponding 2nd char ocra font ocr is converted by the text to be searched String；

Unicode coding is carried out to the 2nd char ocra font ocr string, obtains second coding.

3. being compiled the method according to claim 1, wherein described be directed to each code set according to described first Code carries out matching search in the code set, obtains the coding section with first codes match, comprising:

Using the bebinning character of the code set as current starting point；

First coding and coding to be compared are compared, it is described to be compared to be encoded to current starting point and its later continuous N-1 The character string of a character composition, N are the length of first coding；

If consistent, determine that the coding to be compared belongs to the coding section with first codes match, and will be described to be compared Adjacent character after coding returns as current starting point and executes the step for comparing first coding and coding to be compared Suddenly, until the character in the code set was compared；

If inconsistent, detect whether adjacent the first character after current starting point belongs to first coding；It, will if belonging to M-th character before first character as current starting point, and return execute it is described compare it is described first coding and to than To coding the step of, until the code set in character be compared, wherein M be it is described first coding in it is described Length between the character of first character match and the bebinning character of first coding；If being not belonging to, after current starting point Adjacent character as current starting point, and return execute the detection current starting point after adjacent the first character whether belong to The step of described first coding, until the character in the code set was compared.

4. the method according to claim 1, wherein being existed according to the bebinning character of the coding section and termination character Position in the text to be searched obtains search result, comprising:

Whether the bebinning character and termination character for detecting the coding section are in same a line in the text to be searched；

If so, using the text between the bebinning character and termination character position as described search result；If it is not, Then by from the bebinning character position to the text and the knot between bebinning character end position of the row The text to the termination character position is played in beam character initial position of the row, as described search result.

5. method according to any of claims 1-4, which is characterized in that the banner word according to the coding section Symbol and position of the termination character in the text to be searched, obtain search result after, further includes:

Highlighted processing is carried out to described search result.

6. a kind of text search device, which is characterized in that described device includes:

Conversion module, for being encoded to keyword and text to be searched, obtain corresponding first coding of the keyword and The text to be searched corresponding second encodes；

Search module carries out matching according to first coding in the code set and searches for being directed to each code set Rope obtains the coding section with first codes match；

Module is obtained, the position according to the bebinning character and termination character of the coding section in the text to be searched is also used to It sets, obtains search result.

7. device according to claim 6, which is characterized in that the conversion module includes:

First coding unit converts corresponding first for the keyword for the coding mode according to the keyword Char ocra font ocr string；

Second coding unit obtains first coding for carrying out Unicode coding to the first char ocra font ocr string；

First coding unit is also used to the coding mode according to the text to be searched, converts the text to be searched to pair The 2nd char ocra font ocr string answered；

Second coding unit is also used to carry out Unicode coding to the 2nd char ocra font ocr string, obtains described second and compiles Code.

8. device according to claim 7, which is characterized in that described search module includes:

Selecting unit, for using the bebinning character of the code set as current starting point；

Comparing unit, for compare it is described first coding and coding to be compared, it is described it is to be compared be encoded to current starting point and its The character string of continuous N-1 character composition later, N are the length of first coding；

Processing unit, if for comparison result be it is consistent, determine it is described it is to be compared encode belong to and first codes match Coding section, and using the adjacent character after the coding to be compared as current starting point, and return and execute described in the comparison The step of first coding and coding to be compared, until the character in the code set was compared；

The processing unit, if be also used to the comparison result be it is inconsistent, detect adjacent the first word after current starting point Whether symbol belongs to first coding；If belonging to, using the m-th character before first character as current starting point, and Return execute it is described compare first coding and the step of coding to be compared, until the character in the code set by than To mistake, wherein M be in first coding with the bebinning character of the character of first character match and first coding it Between length；If being not belonging to, using the adjacent character after current starting point as current starting point, and returns and execute the detection institute The step of whether adjacent the first character after coding to be compared belongs to the described first coding is stated, until the word in the code set Fu Jun was compared.

9. device according to claim 8, which is characterized in that the acquisition module includes:

Detection unit, the bebinning character and termination character for detecting the coding section obtained from the matching module are described Whether same a line is in text to be searched；

Acquiring unit, if for testing result be it is yes, by the text between the bebinning character and termination character position As described search result；

The acquiring unit will be from the bebinning character position to institute if being also used to the testing result is not to be The text and termination character initial position of the row stated between bebinning character end position of the row rise to the knot Text between beam character position, as described search result.

10. the device according to any one of claim 6-9, which is characterized in that described device further include:

Labeling module, for carrying out highlighted processing to the described search result obtained from the acquisition module.