CN110245215A - A kind of text searching method and device - Google Patents

A kind of text searching method and device Download PDF

Info

Publication number
CN110245215A
CN110245215A CN201910488395.7A CN201910488395A CN110245215A CN 110245215 A CN110245215 A CN 110245215A CN 201910488395 A CN201910488395 A CN 201910488395A CN 110245215 A CN110245215 A CN 110245215A
Authority
CN
China
Prior art keywords
text
component
material library
text component
inverted index
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910488395.7A
Other languages
Chinese (zh)
Other versions
CN110245215B (en
Inventor
陈若田
刘弘一
熊军
李若鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Advantageous New Technologies Co Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201910488395.7A priority Critical patent/CN110245215B/en
Publication of CN110245215A publication Critical patent/CN110245215A/en
Application granted granted Critical
Publication of CN110245215B publication Critical patent/CN110245215B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

This specification embodiment provides a kind of text searching method and device, the method is carried out based on the text material library obtained in advance, it include multiple texts for specific transactions in the text material library, the method is preset with the text component of predetermined number for the multiple text, each text component and the content of respective type in the multiple text respectively correspond, which comprises obtain user's input;It is inputted from the user and identifies text component wherein included;And based on the text component identified, by the text material library pre-establish using text component as the inverted index table of index button, retrieve multiple texts from the text material library.

Description

A kind of text searching method and device
Technical field
This specification embodiment is related to language processing techniques field, more particularly, to a kind of text searching method and dress It sets.
Background technique
Official documents and correspondence (that is, a kind of text for specific transactions) is often referred to the text for showing the business game formulated Word, for example, the marketing official documents and correspondence for sales service is the brief textual for describing marketing strategy, to reach for dispensing of marketing The purpose increased to user.In the prior art, copy writing personnel are when writing official documents and correspondence, usually using keyword retrieval function It retrieves official documents and correspondence material from existing history official documents and correspondence material database to use for reference as writing, to improve the efficiency and quality of writing.However, Keyword retrieval is only capable of recalling literal identical official documents and correspondence, especially when the official documents and correspondence in material database including same keyword is considerably less When (even without), recall can official documents and correspondence quantity for reference may considerably less (even without).
Therefore, it is necessary to a kind of schemes of more effectively retrieval text.
Summary of the invention
This specification embodiment is intended to provide a kind of more effective text retrieval scheme, with solve it is in the prior art not Foot.
To achieve the above object, this specification provides a kind of text searching method on one side, and the method is based on preparatory The text material library of acquisition carries out, and includes multiple texts for specific transactions in the text material library, and the method is directed to The multiple text is preset with the text component of predetermined number, respective type in each text component and the multiple text Content respectively correspond, which comprises
Obtain user's input;
It is inputted from the user and identifies text component wherein included;And
Based on the text component identified, by the text material library pre-establish with text component for inspection The inverted index table of Suo Jian retrieves multiple texts from the text material library.
In one embodiment, it is inputted from the user and identifies that text component wherein included includes, by training in advance Sequence labelling model inputted from the user and identify text component wherein included.
In one embodiment, based on the text component identified, pass through pre-establishing for the text material library Using text component as the inverted index table of index button, retrieving multiple texts from the text material library includes:
Obtain whole nonvoid subsets of the set of the text component identified described in including;
It relative to each subset, is retrieved by the inverted index table, obtains corresponding search result, In, the corresponding search result of the subset is that each text component carries out retrieving acquired retrieval knot as index button using in the subset The intersection of fruit, the search result are the list of the Text Flag of corresponding text.
In one embodiment, based on the text component identified, pass through pre-establishing for the text material library Using text component as the inverted index table of index button, retrieving multiple texts from the text material library further includes,
Relative to each subset, retrieved by the inverted index table, obtain corresponding search result it Afterwards, the full text for including in whole search results is identified, based at least one of following to the full text mark-row Sequence: the number of the text component for including in the corresponding subset of each Text Flag and the corresponding text of each Text Flag with The similarity of user's input.
It in one embodiment, include that the full text is identified, is first based on to full text mark sequence The number for the text component for including in the corresponding subset of each Text Flag carries out the first hierarchical ranking, then is based on each text mark The similarity for knowing corresponding text and user input carries out the second hierarchical ranking under the first layer time.
In one embodiment, the method also includes, after retrieving multiple texts in the text material library, Based on the sequence to full text mark, Xiang Suoshu user shows the text retrieved.
In one embodiment, based on the sequence to full text mark, Xiang Suoshu user shows and retrieves Text include that in displayed page, in addition to being based on the sequence, Xiang Suoshu user is shown except the text that retrieves, also mixed It closes ground and shows the multiple texts for carrying out keyword retrieval acquisition by inputting for the user.
On the other hand this specification provides a kind of method of inverted index table for constructing text material library, the text material Library includes multiple texts for specific transactions, the method for the multiple text be preset with the text of predetermined number at Point, each text component and the content of respective type in the multiple text respectively correspond, which comprises
For each text in text material library, text component wherein included is identified from the text;And
Based on the text component that each text includes, construct the inverted index table in the text material library, wherein it is described fall First index button of row's concordance list is the first text component in the text component of predetermined number, corresponding with first index button Searching value be each text comprising first text component Text Flag.
In one embodiment, the text component for including based on each text constructs the row's of falling rope in the text material library Drawing table includes, and the keyword that the text component and each text for including based on each text include constructs the text material library Inverted index table, wherein the second index button of the inverted index table be the first keyword, it is corresponding with second index button Searching value be each text comprising first keyword Text Flag, wherein first keyword be it is the multiple A keyword for including in text, wherein in the inverted index table, first index button is indicated by predetermined mark Corresponding to text component.
On the other hand this specification provides a kind of text retrieval device, described device is based on the text material library obtained in advance Implement, include multiple texts for specific transactions in the text material library, described device is default for the multiple text There is the text component of predetermined number, each text component and the content of respective type in the multiple text respectively correspond, Described device includes:
Acquiring unit is configured to, and obtains user's input;
Recognition unit is configured to, and is inputted from the user and is identified text component wherein included;And
Retrieval unit is configured to, and based on the text component identified, passes through pre-establishing for the text material library Using text component as the inverted index table of index button, retrieve multiple texts from the text material library.
In one embodiment, the recognition unit is additionally configured to, by sequence labelling model trained in advance from described User, which inputs, identifies text component wherein included.
In one embodiment, the retrieval unit further include:
Subelement is obtained, is configured to, whole nonvoid subsets of the set of the text component identified described in including are obtained;
Subelement is retrieved, is configured to, relative to each subset, is retrieved by the inverted index table, is obtained Corresponding search result, wherein the corresponding search result of the subset is that each text component is carried out as index button using in the subset The intersection of the acquired search result of retrieval, the search result are the list of the Text Flag of corresponding text.
In one embodiment, the retrieval unit further include:
Sorting subunit is configured to, and relative to each subset, is retrieved, is obtained by the inverted index table After taking corresponding search result, the full text for including in whole search results is identified, based at least one of following right Full text mark sequence: the number for the text component for including in the corresponding subset of each Text Flag and each text This identifies the similarity of corresponding text and user input.
In one embodiment, the sorting subunit is additionally configured to, and the full text is identified, first based on each The number for the text component for including in the corresponding subset of Text Flag carries out the first hierarchical ranking, then is based on each Text Flag pair The text answered and the similarity of user input carry out the second hierarchical ranking under the first layer time.
In one embodiment, described device further includes that display unit is configured to, and is examined from the text material library Rope goes out after multiple texts, and based on the sequence to full text mark, Xiang Suoshu user shows the text retrieved.
In one embodiment, the display unit is additionally configured to, in displayed page, in addition to being based on the sequence, to The user shows except the text retrieved, also mixedly shows and is obtained by inputting progress keyword retrieval for the user The multiple texts taken.
On the other hand this specification provides a kind of device of inverted index table for constructing text material library, the text material Library includes multiple texts for specific transactions, described device for the multiple text be preset with the text of predetermined number at Point, each text component and the content of respective type in the multiple text respectively correspond, and described device includes:
Recognition unit is configured to, and for each text in text material library, identifies text wherein included from the text Ingredient;And
Construction unit is configured to, and based on the text component that each text includes, constructs the row's of falling rope in the text material library Draw table, wherein the first index button of the inverted index table is the first text component in the text component of predetermined number, with institute State the Text Flag that the corresponding searching value of the first index button is each text comprising first text component.
In one embodiment, the construction unit is additionally configured to, the text component that includes based on each text and each The keyword that text includes constructs the inverted index table in the text material library, wherein the second retrieval of the inverted index table Key is the first keyword, and searching value corresponding with second index button is the text of each text comprising first keyword Mark, wherein first keyword is a keyword for including in the multiple text, wherein in the inverted index In table, indicate that first index button corresponds to text component by predetermined mark.
On the other hand this specification provides a kind of computer readable storage medium, be stored thereon with computer program, work as institute When stating computer program and executing in a computer, computer is enabled to execute any of the above-described method.
On the other hand this specification provides a kind of calculating equipment, including memory and processor, which is characterized in that described to deposit It is stored with executable code in reservoir, when the processor executes the executable code, realizes any of the above-described method.
It by the text retrieval scheme according to this specification embodiment, is retrieved based on text component, or base simultaneously It is retrieved in text component and text key word, has both met the accuracy of retrieval, also meet the diversity of retrieval.
Detailed description of the invention
This specification embodiment is described in conjunction with the accompanying drawings, and this specification embodiment can be made clearer:
Fig. 1 shows the text retrieval system 100 according to this specification embodiment;
Fig. 2 shows a kind of text searching methods according to this specification embodiment;
Fig. 3 schematically illustrates the process that search engine is retrieved based on above two index button;
The method that Fig. 4 shows a kind of inverted index table in building text material library according to this specification embodiment;
Fig. 5, which is shown, includes the process of the inverted index table of two kinds of index buttons according to the foundation of this specification embodiment;
Fig. 6 shows a kind of text retrieval device 600 according to this specification embodiment;
Fig. 7 shows a kind of device 700 of the inverted index table in building text material library according to this specification embodiment.
Specific embodiment
This specification embodiment is described below in conjunction with attached drawing.
Fig. 1 shows the text retrieval system 100 according to this specification embodiment.As shown, system 100 includes: index Construction unit 11, retrieval unit 12, sequencing unit 13 and display unit 14.Wherein, index construct unit 11 is for constructing text The inverted index of material database.In this specification embodiment, which may include the inverted index based on text component, Alternatively, the inverted index may include the inverted index based on text component and keyword.Wherein, it falls to arrange rope in building text component When drawing, for each text in material database, the text in preparatory trained each text of sequence labelling model extraction can be passed through Ingredient, and constructed based on the text component in each text using text component and be as key, with the Text Flag comprising text ingredient The inverted index table of value.Retrieval unit 12, which is used to input based on user, carries out retrieving.The user is usually text creator Member can be related to the text that it will write by inputting to retrieval unit 12 when writing the text for specific transactions Keyword, and an at least text is retrieved from text material relevant to specific transactions library with for referring to.Retrieval is single Member 12 can wherein carry out the retrieval using text component as index button, or carry out simultaneously after receiving the input of user Using text component as index button and using keyword as the retrieval of index button, to obtain search result.Retrieval unit 12 is obtaining Search result can be sent to sequencing unit 13 after search result, so that sequencing unit 13 is based on scheduled sort by this Search result is ranked up, to obtain ranked search result, and the preceding text that can will sort is sent to display unit 14 to show user.
Above-mentioned each process is described hereinafter.
Fig. 2 shows a kind of text searching methods according to this specification embodiment, and the method is based on the text obtained in advance This material database carries out, and includes multiple texts for specific transactions in the text material library, the method is for the multiple Text is preset with the text component of predetermined number, the content point of respective type in each text component and the multiple text Dui Ying not, which comprises
Step S202 obtains user's input;
Step S204 is inputted from the user and is identified text component wherein included;And
Step S206, based on the text component identified, by the text material library pre-establish with text This ingredient is the inverted index table of index button, retrieves multiple texts from the text material library.
Text search method for example can be used for help user to write official documents and correspondence in intelligent intention platform, such as can be by this Official documents and correspondence search engine in platform executes, and described search engine is after the search key for receiving user's input, in text element Material library is retrieved, and returns to search result.As described above, the text for including in the text material library is for specific industry The text of business, such as sales service, advertising business, publicity business, therefore, the key element ratio of the text in text material database It is relatively fixed, that is, usually there is similar content characteristic or structure feature, can include that keyword carries out ingredient to the text therefore It is abstract.For example, multiple texts in the text material library are marketing text in the case where the specific transactions are sales service Case, multiple marketing official documents and correspondence can be the intention platform oneself accumulation or be obtained by external channel.Typical marketing text Case for example, " * * supermarket, the Spring Festival promote 5 folding of drinks ", " being paid using Alipay, every singly to return now 1 yuan " etc., to these battalion Sell official documents and correspondence and carry out that ingredient is abstract, can by including content conclude into such as eight kinds of ingredients: brand (for example, " * * supermarket ", " Alipay "), action (such as " barcode scanning ", " registration ", " payment " etc.), (such as preferential amount of money returns amount in cash to the amount of money, as above-mentioned " 1 yuan "), discount (such as above-mentioned " 5 folding "), gift (as " iphonex "), festivals or holidays (such as Christmas Day, the Spring Festival), activity scene (such as Above-the-line, Below-the-line), activity venue (such as India, China).It is appreciated that the text material library is targeted Specific transactions be not limited to sales service, advertising business, publicity business etc., can also be various other business, text material Library is abstracted so as to carry out ingredient based on the feature of the text for the business, to obtain the text component of predetermined number.
Each step shown in Fig. 2 is described below in detail.
Step S202 obtains user's input.
User can input search key to above-mentioned search engine, so that the search engine obtains the input of user, and Input based on user executes this method.The input of user can be any text, for example, user's input can be current operation The relevant lists of keywords of activity, can wherein be separated between keyword with space, for example, " Alipay returns 1 yuan existing ".
Step S204 is inputted from the user and is identified text component wherein included.
The text component is described above for the preset text component in text material library, such as above-mentioned for marketing The preset eight kinds of text components in text material library.Sequence labelling model trained in advance can be inputted again by inputting user, To export the text component for including in user input.The sequence labelling model can be BILSTM+CRF model, can also be with Using HMM and CRF model etc., it might even be possible to be the model of rule-based knowledge and dictionary.It can be by also belonging to above-mentioned spy Multiple texts of business are determined to carry out the training to the sequence labelling model, for example, by the multiple texts for obtaining specific transactions, And to each text marking including text component can be by using multiple training to obtain multiple training samples Sequence labelling model described in sample training.After obtaining trained sequence labelling model, for example, defeated for above-mentioned user Enter, which can be sequentially input sequence labelling model, sequence labelling model for example exports wherein by " Alipay returns 1 yuan existing " Text component set { brand, the amount of money }.It is more in the content that user's input includes, to input the text of identification from user Ingredient includes when repeating ingredient, also duplicate removal being carried out to the text component identified, to obtain the set of final text component. For example, including the relevant content of multiple amount of money in user's input, such as " now 1 yuan is returned, 5 yuan of red packets are given ", pass through sequence labelling mould Type may recognize that two " amount of money " ingredients, thus one " amount of money " of removal.
Step S206, based on the text component identified, by the text material library pre-establish with text This ingredient is the inverted index table of index button, retrieves multiple texts from the text material library.
The text material library will be described in more detail below by the foundation of the inverted index table of index button of text component. In the inverted index table, with the text component of preset predetermined number for each index button (key), to include text ingredient Each text Text Flag as searching value (value) corresponding with the index button.
It can be by kinds of schemes based on the text component identified, by above-mentioned inverted index table, from text material Multiple texts are retrieved in library.A kind of specific retrieval scheme is described below as example.
For example, can be primarily based on the set for the set { brand, the amount of money } of above-mentioned text component and obtain its whole non-empty Subset, typically for the set comprising n element, can obtain its 2n- 1 nonvoid subset, therefore, for above-mentioned set { product Board, the amount of money }, three nonvoid subsets can be obtained: { brand, the amount of money }, { brand } and { amount of money }.Then, can for each subset into Row retrieval, to obtain corresponding search result respectively.For example, for subset { brand, the amount of money }, it can be respectively with text component " product Board " and " amount of money " are retrieved for index button, and using the intersection of this search result retrieved twice as corresponding with the subset Search result.For subset { brand }, with text component " brand " can be that index button be retrieved, and using search result as with The corresponding search result of the subset.The search result is the corresponding Text Flag of corresponding text component.For example, with each height Collecting corresponding search result may be as shown in Table 1 below:
Table 1
Wherein, in table 1, number 1~8 is the Text Flag of corresponding text, different from 8 in text material library Text respectively corresponds.For example, 4 corresponding texts of mark " are paid, every singly return shows 1 for the text in material database using Alipay Member " wherein not only having included " brand ", but also includes " amount of money ", therefore since the ingredient that the text includes is { brand, action, the amount of money } Appear in simultaneously { brand, the amount of money }, { brand }, in { amount of money } three corresponding search results of subset.
After obtaining above-mentioned search result corresponding with each subset, each Text Flag group therein can be based on At multiple groups triple (i, pi,si), wherein i is Text Flag.piThe element number in corresponding subset is identified for the text, It can be considered Sort Priority value, for example, the Sort Priority value of the corresponding Text Flag of subset { brand, the amount of money } is 2, subset The Sort Priority value of { brand } corresponding Text Flag is 1.It wherein, can for duplicate Text Flag in table 1, such as " 4 " Remove its piLesser triple is only left piMaximum that group of triple.siCorresponding text is identified for the text and user is defeated The similarity value entered, such as Jaccard coefficient can be used and calculate acquisition.
To, such as according to table 1, triple as shown in Table 2 can be obtained
(4,2,0.8)
(1,1,0.7)
(2,1,0.6)
(3,1,0.5)
(4,1,0.8) (deletion)
(5,1,0.2)
(4,1,0.8) (deletion)
(6,1,0.9)
(7,1,0.4)
(8,1,0.3)
Table 2
So as to be ranked up based on table 2 to each Text Flag, firstly, can be based on the corresponding priority value of each mark The sequence of the first level is carried out to multiple mark, that is to say, that come the triple (4,2,0.8) that priority value is 2 most Front comes each triple that priority value is 1 behind triple (4,2,0.8).It then, can be based in triple Similarity, the sequence to each triple the second level of progress that priority value is 1 are as shown in table 3 through arranging so as to obtain The triple of sequence:
(4,2,0.8)
(6,1,0.9)
(1,1,0.7)
(2,1,0.6)
(3,1,0.5)
(7,1,0.4)
(8,1,0.3)
(5,1,0.2)
Table 3
After obtaining table 3, namely ranked the results list is obtained, search engine is so as to being based on the ranking results Corresponding text is shown in displayed page.For example, being preset as showing 5 texts in the page, so as in the page In this five texts are shown with the sequence of text 4,6,1,2,3.
It is appreciated that above-mentioned retrieval scheme is only schematical, this specification embodiment is without being limited thereto, but can be used It may occur to persons skilled in the art that any retrieval scheme, for example, obtain text component set { brand, the amount of money } it Afterwards, it is not necessarily required to that be classified as 3 subsets is retrieved respectively, for example, can be directly respectively with ingredient " brand " and " amount of money " It is retrieved for index button, to retrieve multiple texts respectively, and each text in the intersection of two search results is carried out Sequencing of similarity, to obtain ranked search result, etc., not example one by one herein.
In one embodiment, it inputs for user, is retrieved in addition to being based on text component index button as described above Except, also retrieved based on the keyword in user's input.Specifically, firstly, extracting keyword from user's input, and lead to The processing such as filtering stop words, duplicate removal, obtains keyword set.Then, it is based on the keyword set, passes through text material database Pre-establish using keyword as the concordance list of index button, can be retrieved and the use with retrieval scheme described above similarly Family inputs corresponding ranked the results list.So as to show two kinds of retrieval knots in displayed page with predetermined ratio mixing Fruit.For example, being preset with 10 exhibition positions in displayed page, then it can set and two the results lists are mixed with the ratio of 5:5 It shows, comes preceding 5 texts for mixing exhibition in displayed page for example, can take out respectively from two the results lists Show.
Fig. 3 schematically illustrates the process that search engine is retrieved based on above two index button.As shown in figure 3, left side is right Retrieval of the Ying Yu based on keyword, right side correspond to the retrieval based on text component.Specifically, in step S31, it is defeated to obtain user Enter.In step S32, is inputted from user and extract keyword;In step S34, the keyword based on extraction generates keyword subset; In step S36, it is based on keyword subset, is retrieved by keyword index table, and sorted to search result, to obtain first The results list.In step S33, identification text component is inputted from user;In step S35, the text component based on identification generates text This is at Molecule Set;In step S37, it is based on text component subset, is retrieved by text component concordance list, and to search result Sequence, to obtain second the results list.Wherein, the dotted line frame in figure indicates that step can be carried out by the same inverted index table S36 and S37 includes simultaneously wherein the index button as text component and the index button as keyword in the inverted index table. In step S38, mixed with the text that sequence of the predetermined ratio to two the results lists is forward;It is defeated and in step S39 The text mixed out, for being shown in displayed page.
The method that Fig. 4 shows a kind of inverted index table in building text material library according to this specification embodiment, it is described Text material library includes multiple texts for specific transactions, and the method is preset with predetermined number for the multiple text Text component, each text component and the content of respective type in the multiple text respectively correspond, which comprises
Step S402 identifies text component wherein included from the text for each text in text material library;With And
Step S404 constructs the inverted index table in the text material library based on the text component that each text includes, In, the first index button of the inverted index table is the first text component in the text component of predetermined number, with described first The corresponding searching value of index button is the Text Flag of each text comprising first text component.
Specifically, in step S402, for each text in text material library, text wherein included is identified from the text This ingredient.The step can refer to above to the specific descriptions of step S204, can be by by each text in text material library It inputs in trained sequence labelling model in advance, to identify the set of the corresponding text component of each text.For example, right In the text " being paid using Alipay, every singly to return now 1 yuan " of material database, by being inputted sequence labelling model, may recognize that The collection of text component therein is combined into { brand, action, the amount of money }, that is, wherein, " Alipay " corresponds to " brand ", and " payment " is right Ying Yu " action ", " 1 yuan " corresponds to the amount of money.
The inverted index table in the text material library is constructed based on the text component that each text includes in step S404, Wherein, the first index button of the inverted index table is the first text component in the text component of predetermined number, with described the The corresponding searching value of one index button is the Text Flag of each text comprising first text component.
The inverted index table is the key assignments table with mapping (map) structure, and wherein key (key) is any text component, It is worth the Text Flag that (value) is each text comprising text ingredient.For example, the text material library is marketing official documents and correspondence Material database, in this case, the key in the inverted index table is for example including above-mentioned eight kinds of text components.For example, for wherein one A key " brand ", corresponding value are the official documents and correspondence mark of each official documents and correspondence in official documents and correspondence material database including " brand " ingredient.
In one embodiment, as shown in Figure 3, it by being retrieved respectively in conjunction with two kinds of index buttons, in this case, is searching During index is held up, it can be based on official documents and correspondence material database, establish two concordance lists for corresponding respectively to two kinds of index buttons, or can also be one It include two kinds of index buttons in a concordance list.In latter case, two kinds of index buttons can be distinguished by predetermined mark.For example, right In the index button " brand " as text component, " # brand # " can be identified as, using with the index button " product as keyword Board " is mutually distinguished, so that search engine obtains right in retrieval table " brand " when being that index button is retrieved with keyword " brand " The Text Flag answered, when being that index button is retrieved with text component " brand ", search engine obtains " # product in retrieval table The corresponding Text Flag of board # ".
Fig. 5, which is shown, includes the process of the inverted index table of two kinds of index buttons according to the foundation of this specification embodiment.Such as Fig. 5 It is shown, in step S51, from each Text Feature Extraction keyword in official documents and correspondence material database;In step S52, in official documents and correspondence material database Each text identification text component;In step S53, keyword and text component based on each text establish inverted index table, So as to obtain the inverted index table including two kinds of index buttons of keyword and text component in figure.
Fig. 6 shows a kind of text retrieval device 600 according to this specification embodiment, and described device based on obtaining in advance Text material library is implemented, and includes multiple texts for specific transactions in the text material library, described device is for described more A text is preset with the text component of predetermined number, the content of respective type in each text component and the multiple text It respectively corresponds, described device includes:
Acquiring unit 61, is configured to, and obtains user's input;
Recognition unit 62, is configured to, and inputs from the user and identifies text component wherein included;And
Retrieval unit 63, is configured to, and based on the text component identified, passes through building in advance for the text material library It is vertical using text component as the inverted index table of index button, retrieve multiple texts from the text material library.
In one embodiment, the recognition unit 62 is additionally configured to, by sequence labelling model trained in advance from institute It states user and inputs identification text component wherein included.
In one embodiment, the retrieval unit 63 further include:
Subelement 631 is obtained, is configured to, the non-gap of whole of the set of the text component identified described in including is obtained Collection;
Subelement 632 is retrieved, is configured to, relative to each subset, is retrieved, obtained by the inverted index table Take corresponding search result, wherein the corresponding search result of the subset be using in the subset each text component as index button into The intersection of the acquired search result of row retrieval, the search result are the list of the Text Flag of corresponding text.
In one embodiment, the retrieval unit 63 further include:
Sorting subunit 633, is configured to, and relative to each subset, is retrieved by the inverted index table, After obtaining corresponding search result, the full text for including in whole search results is identified, based at least one of following The full text is identified and is sorted: the number for the text component for including in the corresponding subset of each Text Flag and each The similarity of the corresponding text of Text Flag and user input.
In one embodiment, the sorting subunit 633 is additionally configured to, and the full text is identified, is first based on The number for the text component for including in the corresponding subset of each Text Flag carries out the first hierarchical ranking, then is based on each text mark The similarity for knowing corresponding text and user input carries out the second hierarchical ranking under the first layer time.
In one embodiment, described device 600 further includes that display unit 64 is configured to, from the text material library In retrieve multiple texts after, based on the sequence to full text mark, Xiang Suoshu user shows and retrieves Text.
In one embodiment, the display unit 64 is additionally configured to, in displayed page, in addition to being based on the sequence, It is shown except the text retrieved to the user, also mixedly shows and carry out keyword retrieval by inputting for the user The multiple texts obtained.
Fig. 7 shows a kind of device 700 of the inverted index table in building text material library according to this specification embodiment, institute Stating text material library includes multiple texts for specific transactions, and described device is preset with predetermined number for the multiple text Text component, each text component and the content of respective type in the multiple text respectively correspond, described device packet It includes:
Recognition unit 71, is configured to, and for each text in text material library, identifies text wherein included from the text This ingredient;And
Construction unit 72, is configured to, and based on the text component that each text includes, constructs the row of falling in the text material library Concordance list, wherein the first index button of the inverted index table is the first text component in the text component of predetermined number, with The corresponding searching value of first index button is the Text Flag of each text comprising first text component.
In one embodiment, the construction unit is additionally configured to, the text component that includes based on each text and each The keyword that text includes constructs the inverted index table in the text material library, wherein the second retrieval of the inverted index table Key is the first keyword, and searching value corresponding with second index button is the text of each text comprising first keyword Mark, wherein first keyword is a keyword for including in the multiple text, wherein in the inverted index In table, indicate that first index button corresponds to text component by predetermined mark.
On the other hand this specification provides a kind of computer readable storage medium, be stored thereon with computer program, work as institute When stating computer program and executing in a computer, computer is enabled to execute any of the above-described method.
On the other hand this specification provides a kind of calculating equipment, including memory and processor, which is characterized in that described to deposit It is stored with executable code in reservoir, when the processor executes the executable code, realizes any of the above-described method.
It by the text retrieval scheme according to this specification embodiment, is retrieved based on text component, or base simultaneously It is retrieved in text component and text key word, literal identical text can be recalled, can also recall literal different but text The identical text of ingredient, had both met the accuracy of retrieval, also met the diversity of retrieval, met the multiplicity of text writing Property and rich demand, avoid repetition launch after human fatigue.
It is to be understood that herein " first ", the description such as " second ", it is for illustration only simple and to similar concept into Row is distinguished, and does not have other restriction effects.
All the embodiments in this specification are described in a progressive manner, same and similar portion between each embodiment Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for system reality For applying example, since it is substantially similar to the method embodiment, so being described relatively simple, related place is referring to embodiment of the method Part explanation.
It is above-mentioned that this specification specific embodiment is described.Other embodiments are in the scope of the appended claims It is interior.In some cases, the movement recorded in detail in the claims or step can be come according to the sequence being different from embodiment It executes and desired result still may be implemented.In addition, process depicted in the drawing not necessarily require show it is specific suitable Sequence or consecutive order are just able to achieve desired result.In some embodiments, multitasking and parallel processing be also can With or may be advantageous.
Those of ordinary skill in the art should further appreciate that, describe in conjunction with the embodiments described herein Each exemplary unit and algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, in order to clear Illustrate to Chu the interchangeability of hardware and software, generally describes each exemplary group according to function in the above description At and step.These functions hold track actually with hardware or software mode, depending on technical solution specific application and set Count constraint condition.Those of ordinary skill in the art can realize each specific application using distinct methods described Function, but this realization is it is not considered that exceed scope of the present application.
The step of method described in conjunction with the examples disclosed in this document or algorithm, can hold track with hardware, processor Software module or the combination of the two implement.Software module can be placed in random access memory (RAM), memory, read-only storage Device (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technology neck In any other form of storage medium well known in domain.
Above-described specific embodiment has carried out further the purpose of the present invention, technical scheme and beneficial effects It is described in detail, it should be understood that being not intended to limit the present invention the foregoing is merely a specific embodiment of the invention Protection scope, all within the spirits and principles of the present invention, any modification, equivalent substitution, improvement and etc. done should all include Within protection scope of the present invention.

Claims (20)

1. a kind of text searching method, the method is carried out based on the text material library obtained in advance, in the text material library Including multiple texts for specific transactions, the method is preset with the text component of predetermined number for the multiple text, Each text component and the content of respective type in the multiple text respectively correspond, which comprises
Obtain user's input;
It is inputted from the user and identifies text component wherein included;And
Based on the text component identified, by the text material library pre-establish using text component as index button Inverted index table, retrieve multiple texts from the text material library.
2. it is inputted from the user and identifies that text component wherein included includes according to the method described in claim 1, wherein, It is inputted by sequence labelling model trained in advance from the user and identifies text component wherein included.
3. according to the method described in claim 1, wherein, based on the text component identified, passing through the text material Library pre-establish using text component as the inverted index table of index button, multiple text packets are retrieved from the text material library It includes:
Obtain whole nonvoid subsets of the set of the text component identified described in including;
It relative to each subset, is retrieved by the inverted index table, obtains corresponding search result, wherein should The corresponding search result of subset is that each text component carries out retrieving acquired search result as index button using in the subset Intersection, the search result are the list of the Text Flag of corresponding text.
4. according to the method described in claim 3, passing through the pre- of the text material library based on the text component identified First establish using text component as the inverted index table of index button, retrieving multiple texts from the text material library further includes,
Relative to each subset, retrieved by the inverted index table, it is right after obtaining corresponding search result The full text mark for including in whole search results, is sorted: each based at least one of following identify to the full text The number and the corresponding text of each Text Flag of the text component for including in the corresponding subset of a Text Flag and the use The similarity of family input.
5. according to the method described in claim 4, wherein, including to full text mark sequence, for all texts This mark, first the number based on the text component for including in the corresponding subset of each Text Flag carries out the first hierarchical ranking, then Similarity based on the corresponding text of each Text Flag and user input carries out second under the first layer time Hierarchical ranking.
6. according to the method described in claim 4, further include, after retrieving multiple texts in the text material library, Based on the sequence to full text mark, Xiang Suoshu user shows the text retrieved.
7. according to the method described in claim 6, wherein, based on the sequence to full text mark, to the use Family shows that the text retrieved includes, and in displayed page, in addition to being based on the sequence, Xiang Suoshu user shows the text retrieved Except this, multiple texts that keyword retrieval acquisition is carried out by inputting for the user are also mixedly shown.
8. a kind of method for the inverted index table for constructing text material library, the text material library includes for the more of specific transactions A text, the method are preset with the text component of predetermined number, each text component and institute for the multiple text The content for stating respective type in multiple texts respectively corresponds, which comprises
For each text in text material library, text component wherein included is identified from the text;And
Based on the text component that each text includes, the inverted index table in the text material library is constructed, wherein the row's of falling rope Draw table the first index button be the predetermined number text component in the first text component, it is corresponding with first index button Searching value be each text comprising first text component Text Flag.
9. according to the method described in claim 8, wherein, based on the text component that each text includes, constructing the text element The keyword that the inverted index table in material library includes the text component for including based on each text and each text includes constructs institute State the inverted index table in text material library, wherein the second index button of the inverted index table is the first keyword, with described the The corresponding searching value of two index buttons is the Text Flag of each text comprising first keyword, wherein described first is crucial Word is a keyword for including in the multiple text, wherein in the inverted index table, indicates institute by predetermined mark The first index button is stated corresponding to text component.
10. a kind of text retrieval device, described device is implemented based on the text material library obtained in advance, in the text material library Including multiple texts for specific transactions, described device is preset with the text component of predetermined number for the multiple text, Each text component and the content of respective type in the multiple text respectively correspond, and described device includes:
Acquiring unit is configured to, and obtains user's input;
Recognition unit is configured to, and is inputted from the user and is identified text component wherein included;And
Retrieval unit is configured to, based on the text component identified, by the text material library pre-establish with Text component is the inverted index table of index button, retrieves multiple texts from the text material library.
11. device according to claim 10, wherein the recognition unit is additionally configured to, and passes through sequence trained in advance Marking model inputs from the user and identifies text component wherein included.
12. device according to claim 10, wherein the retrieval unit further include:
Subelement is obtained, is configured to, whole nonvoid subsets of the set of the text component identified described in including are obtained;
Subelement is retrieved, is configured to, relative to each subset, is retrieved by the inverted index table, is obtained corresponding Search result, wherein the corresponding search result of the subset be using in the subset each text component retrieved as index button The intersection of acquired search result, the search result are the list of the Text Flag of corresponding text.
13. device according to claim 12, the retrieval unit further include:
Sorting subunit is configured to, and relative to each subset, is retrieved by the inverted index table, and phase is obtained After the search result answered, the full text for including in whole search results is identified, based at least one of following to described Full text mark sequence: the number for the text component for including in the corresponding subset of each Text Flag and each text mark Know the similarity of corresponding text and user input.
14. device according to claim 13, wherein the sorting subunit is additionally configured to, for the full text Mark, first the number based on the text component for including in the corresponding subset of each Text Flag carries out the first hierarchical ranking, then base The second layer under the first layer time is carried out in the similarity of the corresponding text of each Text Flag and user input Minor sort.
15. device according to claim 13, further includes, display unit is configured to, and is examined from the text material library Rope goes out after multiple texts, and based on the sequence to full text mark, Xiang Suoshu user shows the text retrieved.
16. device according to claim 15, wherein the display unit is additionally configured to, in displayed page, in addition to base In the sequence, Xiang Suoshu user is shown except the text retrieved, also mixedly show by for the user input into Multiple texts that row keyword retrieval obtains.
17. a kind of device for the inverted index table for constructing text material library, the text material library includes for specific transactions Multiple texts, described device are preset with the text component of predetermined number for the multiple text, each text component with The content of respective type respectively corresponds in the multiple text, and described device includes:
Recognition unit is configured to, for each text in text material library, from the text identify text wherein included at Point;And
Construction unit is configured to, and based on the text component that each text includes, constructs the inverted index in the text material library Table, wherein the first index button of the inverted index table is the first text component in the text component of predetermined number, and described The corresponding searching value of first index button is the Text Flag of each text comprising first text component.
18. device according to claim 17, wherein the construction unit is additionally configured to, and includes based on each text The keyword that text component and each text include constructs the inverted index table in the text material library, wherein the row of falling Second index button of concordance list is the first keyword, and searching value corresponding with second index button is to include first keyword Each text Text Flag, wherein first keyword be the multiple text in include a keyword, In, in the inverted index table, indicate that first index button corresponds to text component by predetermined mark.
19. a kind of computer readable storage medium, is stored thereon with computer program, when the computer program in a computer When execution, computer perform claim is enabled to require the method for any one of 1-9.
20. a kind of calculating equipment, including memory and processor, which is characterized in that be stored with executable generation in the memory Code realizes method of any of claims 1-9 when the processor executes the executable code.
CN201910488395.7A 2019-06-05 2019-06-05 Text retrieval method and device Active CN110245215B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910488395.7A CN110245215B (en) 2019-06-05 2019-06-05 Text retrieval method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910488395.7A CN110245215B (en) 2019-06-05 2019-06-05 Text retrieval method and device

Publications (2)

Publication Number Publication Date
CN110245215A true CN110245215A (en) 2019-09-17
CN110245215B CN110245215B (en) 2023-10-20

Family

ID=67886342

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910488395.7A Active CN110245215B (en) 2019-06-05 2019-06-05 Text retrieval method and device

Country Status (1)

Country Link
CN (1) CN110245215B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110765756A (en) * 2019-10-29 2020-02-07 北京齐尔布莱特科技有限公司 Text processing method and device, computing equipment and medium
CN114154072A (en) * 2021-12-08 2022-03-08 北京度友信息技术有限公司 Search method, search device, electronic device, and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101169780A (en) * 2006-10-25 2008-04-30 华为技术有限公司 Semantic ontology retrieval system and method
US20110022600A1 (en) * 2009-07-22 2011-01-27 Ecole Polytechnique Federale De Lausanne Epfl Method of data retrieval, and search engine using such a method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101169780A (en) * 2006-10-25 2008-04-30 华为技术有限公司 Semantic ontology retrieval system and method
US20110022600A1 (en) * 2009-07-22 2011-01-27 Ecole Polytechnique Federale De Lausanne Epfl Method of data retrieval, and search engine using such a method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
宁可为等: "基于倒排索引的答疑系统知识库文本研究", 《湖北广播电视大学学报》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110765756A (en) * 2019-10-29 2020-02-07 北京齐尔布莱特科技有限公司 Text processing method and device, computing equipment and medium
CN110765756B (en) * 2019-10-29 2023-12-01 北京齐尔布莱特科技有限公司 Text processing method, device, computing equipment and medium
CN114154072A (en) * 2021-12-08 2022-03-08 北京度友信息技术有限公司 Search method, search device, electronic device, and storage medium

Also Published As

Publication number Publication date
CN110245215B (en) 2023-10-20

Similar Documents

Publication Publication Date Title
US7483881B2 (en) Determining unambiguous geographic references
CN103201737B (en) Conjunctive word calling mechanism, information processor, conjunctive word register method, conjunctive word calling mechanism program and record medium
JP3563682B2 (en) Next search candidate word presentation method and apparatus, and recording medium storing next search candidate word presentation program
CN100504866C (en) Integrative searching result sequencing system and method
CN108628833B (en) Method and device for determining summary of original content and method and device for recommending original content
CN105550369B (en) A kind of method and device for searching for end article collection
CN105447186B (en) A kind of user behavior analysis system based on big data platform
CN100524310C (en) System and method for extraction of factoids from textual repositories
CN105787025B (en) Network platform public account classification method and device
CN103885983B (en) Determination method, optimization method and the device of a kind of travelling route
CN109933660B (en) API information search method towards natural language form based on handout and website
CN104636429B (en) Trademark class search method and device
WO2007117979A2 (en) System and method of segmenting and tagging entities based on profile matching using a multi-media survey
CN107256513A (en) Method and device is recommended in a kind of collocation of object
US20110161144A1 (en) Information extraction system, information extraction method, information extraction program, and information service system
CN103309869B (en) Method and system for recommending display keyword of data object
CN107168991A (en) A kind of search result methods of exhibiting and device
CN106777282B (en) The sort method and device of relevant search
CN105975537A (en) Sorting method and device of application program
CN110245215A (en) A kind of text searching method and device
US8548999B1 (en) Query expansion
CN106919588A (en) A kind of application program search system and method
CN102053960B (en) Method and system for constructing quick and accurate Internet of things and Internet search engine according to group requirement characteristics
CN107092621A (en) Information search method and device
Viriyayudhakorn et al. A comparison of four association engines in divergent thinking support systems on wikipedia

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20200927

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Innovative advanced technology Co.,Ltd.

Address before: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant before: Advanced innovation technology Co.,Ltd.

Effective date of registration: 20200927

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Advanced innovation technology Co.,Ltd.

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Applicant before: Alibaba Group Holding Ltd.

GR01 Patent grant
GR01 Patent grant