CN112800316A

CN112800316A - Search keyword extraction system based on double-array dictionary tree

Info

Publication number: CN112800316A
Application number: CN202110151716.1A
Authority: CN
Inventors: 张凤超
Original assignee: Beijing Yiche Interconnection Information Technology Co ltd
Current assignee: Beijing Yiche Interconnection Information Technology Co ltd
Priority date: 2021-02-04
Filing date: 2021-02-04
Publication date: 2021-05-14

Abstract

The application discloses search keyword extraction system based on double-array dictionary tree, including user interface, inquiry operation module, retrieval module, sequencing module, text operation module, index module, database management module, text database module, first participle module and second participle module, retrieval module's inside is provided with first participle module, and index module's inside is provided with the second participle module, interconnect between user interface and the inquiry operation module, interconnect between inquiry operation module and the retrieval module, interconnect between retrieval module and the sequencing module, interconnect between user interface and the database management module. The method has the advantages that the AC state machine is fully utilized to complete pattern matching at high speed, so that the automobile related words in the phrase text can be quickly recognized, the detailed intention of the user can be obtained, the intention can be transmitted to the subsequent searching process, and the retrieval result can be more in line with the expectation of the user.

Description

Search keyword extraction system based on double-array dictionary tree

Technical Field

The application relates to a keyword extraction system, in particular to a search keyword extraction system based on a double-array dictionary tree.

Background

When a user uses a search engine to search contents, if long-tail words appear, returned results are often bad, the former results may not be wanted by the user, because all words in the middle of the long-tail words are undifferentiated, and based on semantic analysis, the user knows that a word is in the middle and a few words are keywords. For an engine in the automobile industry, words related to automobiles and input information by a user need to be extracted. So as to facilitate later better analysis processing.

The basic solutions to this problem industry are as follows:

scheme one, TF-IDF algorithm, TF-IDF is a numerical statistical method, is used for reflecting the importance of a word to some document in anticipation, its main thought is: if the frequency of a word appearing in a document is high, TF is high; and is rarely found in other documents, i.e., the IDF is high, the word is considered to have a good category discrimination ability.

The second scheme and the TextRank algorithm have the important characteristic that the key words of a document can be extracted by analyzing a single document only by separating from the background of a corpus.

The third scheme is as follows: matching candidate words: the method is characterized in that candidates are obtained based on multi-pattern matching of a keyword word stock, the most important work is word stock construction, and a plurality of methods are fused: vertical site proper nouns, encyclopedia entries, input method cell lexicon, advertiser purchase words

The first problem and the first disadvantage of the scheme are that sometimes the word frequency is used for measuring the importance of a word in an article, sometimes the important word is not enough, and the calculation cannot reflect the position information and the importance of the word in the context.

The second problem, the solution of the second solution, is based on the PageRank, and the PageRank data needs to be prepared, but the real-time performance of the identification is not good, and the old page is higher than the new page. Since even a very good new page will not have many upstream links unless it is a child of a site.

The third problem, the third use scheme, is too dependent on the real-time property of the dictionary and word stock construction. Need to be refreshed frequently to meet the needs. Therefore, a search keyword extraction system based on a double-array dictionary tree is proposed to solve the above problems.

Disclosure of Invention

A search keyword extraction system based on a double-array dictionary tree comprises a user interface, a query operation module, a retrieval module, a sequencing module, a text operation module, an indexing module, an index module, a database management module, a text database module, a first word segmentation module and a second word segmentation module, wherein the first word segmentation module is arranged in the retrieval module, and the second word segmentation module is arranged in the index module;

the user interface is connected with the query operation module, the query operation module is connected with the retrieval module, and the retrieval module is connected with the sequencing module.

Further, the user interface is interconnected with the database management module.

Further, the text operation module and the database management module are connected with each other.

Further, the text operation module and the indexing module are connected with each other.

Further, the indexing module and the indexing module are connected with each other.

Further, the indexing module and the retrieval module are connected with each other.

Further, the indexing module and the database management module are connected with each other.

Further, the database management module and the text database module are connected with each other.

Further, the user interface is a third party packaging, HTTP protocol.

Further, the step of extracting the keywords in the index module is as follows:

(1) setting an array subscript as i, and if both base [ i ] and check [ i ] are 0, indicating that the position is empty;

(2) if base [ i ] is negative (leaf node) then it indicates that the state may be an end state;

the subscript of the state Ab, Ac, Ad... An, A in base [ ] is i, and base [ i ] ═ j is arranged;

(3) to ensure that the direct children of A can be placed into the array, j should satisfy:

base[j+b]＝＝0,base[j+c]＝＝0,base[j+d]＝＝0...base[j+n]＝＝0；

check[j+b]＝＝0,check[j+c]＝＝0.......check[j+n]＝＝0,

after the value of j is determined, the subscripts of Ab, Ac, Ad... An, are also determined.

J + n, j + b, j + c, j + d

Simultaneously ordering:

check[j+b]＝i,check[j+c]＝i,check[j+d]＝i.......check[j+n]＝i

(4) query

The beneficial effect of this application is: the search keyword extraction system based on the double-array dictionary tree can achieve the purpose of rapidly recognizing automobile related words in phrase texts.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

FIG. 1 is a schematic overall structure diagram of an embodiment of the present application;

FIG. 2 is a schematic diagram of an internal structure of a search module according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an internal structure of an index module according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a keyword extraction process according to an embodiment of the present application;

FIG. 5 is a diagram of a dictionary according to an embodiment of the present application.

In the figure: 1. the system comprises a user interface, a query operation module, a search module, a sorting module, a text operation module, a text indexing module, a text database module, a text indexing module, a text database module, a text segmentation module, a first segmentation module, a text segmentation module and a second segmentation module, wherein the user interface comprises 2 the query operation module, 3 the search module, 4 the sorting module, 5 the text operation module, 6 the indexing module, 7 the.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be used. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In this application, the terms "upper", "lower", "left", "right", "front", "rear", "top", "bottom", "inner", "outer", "middle", "vertical", "horizontal", "lateral", "longitudinal", and the like indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings. These terms are used primarily to better describe the present application and its embodiments, and are not used to limit the indicated devices, elements or components to a particular orientation or to be constructed and operated in a particular orientation.

Moreover, some of the above terms may be used to indicate other meanings besides the orientation or positional relationship, for example, the term "on" may also be used to indicate some kind of attachment or connection relationship in some cases. The specific meaning of these terms in this application will be understood by those of ordinary skill in the art as appropriate.

Furthermore, the terms "mounted," "disposed," "provided," "connected," and "sleeved" are to be construed broadly. For example, it may be a fixed connection, a removable connection, or a unitary construction; can be a mechanical connection, or an electrical connection; may be directly connected, or indirectly connected through intervening media, or may be in internal communication between two devices, elements or components. The specific meaning of the above terms in the present application can be understood by those of ordinary skill in the art as appropriate.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Referring to fig. 1-5, a search keyword extraction system based on a double-array dictionary tree includes a user interface 1, a query operation module 2, a retrieval module 3, a sorting module 4, a text operation module 5, an indexing module 6, an indexing module 7, a database management module 8, a text database module 9, a first segmentation module 10 and a second segmentation module 11, wherein the retrieval module 3 is internally provided with the first segmentation module 10, and the indexing module 7 is internally provided with the second segmentation module 11;

the user interface 1 is connected with the query operation module 2, the query operation module 2 is connected with the retrieval module 3, and the retrieval module 3 is connected with the sequencing module 4;

the user interface 1 and the database management module 8 are connected with each other; the text operation module 5 and the database management module 8 are connected with each other; the text operation module 5 and the indexing module 6 are connected with each other; the indexing module 6 and the indexing module 7 are connected with each other; the indexing module 7 and the retrieval module 3 are connected with each other; the indexing module 6 is connected with the database management module 8; the database management module 8 and the text database module 9 are connected with each other; the user interface 1 is a third party packaging and HTTP protocol;

the user interface (1) is used for providing an interface for a caller, and can be in a third-party packaging and HTTP (hyper text transport protocol) mode; the query operation module (2) performs matching search according to the keywords of the user; the retrieval module (3) extracts keywords for query; the indexing module (7) carries out the process of establishing reverse and sequential indexes after word segmentation and merging on the source data; the text operation module (5) performs a text filtering process and performs word segmentation cleaning; and the indexing module (6) performs an article classification quality score calculation process.

The step of extracting the key words in the index module 7 is as follows:

base[j+b]＝＝0,base[j+c]＝＝0,base[j+d]＝＝0...base[j+n]＝＝0；

check[j+b]＝＝0,check[j+c]＝＝0.......check[j+n]＝＝0,

J + n, j + b, j + c, j + d

Simultaneously ordering:

check[j+b]＝i,check[j+c]＝i,check[j+d]＝i.......check[j+n]＝i

(4) query

When the system is used, the system is provided for a caller through a user interface, the user interface can enable a third party to package and have an HTTP protocol, data are transmitted to the query operation module 2 through the user interface according to the requirements of the user and are matched and searched through the query operation module 2 according to keywords provided by the user, the source data are segmented and merged through the first segmentation module 10 through the retrieval module 3 and then are inverted and sequentially indexed, the source data are transmitted back to the user interface 1 through the sequencing module 4, and query operation can be continued through user feedback;

extract the inside data of text database module 9 through database management module 8, carry data to indexing module 6 department through database management module 8, carry out categorised quality branch calculation to article data through indexing module 6, carry text data to text operation module 5 department through database management module 8, filter the text through text operation module 5, the participle is clear, the document after will falling through indexing module 6 is carried to index module 7 department, carry the text to user interface 1 department through database management module 8, establish the back of merging source data through the participle in the index module 7 and fall, the process of arranging the index in order, extract the keyword step in the index module 7 and do:

base[j+b]＝＝0,base[j+c]＝＝0,base[j+d]＝＝0...base[j+n]＝＝0；

check[j+b]＝＝0,check[j+c]＝＝0.......check[j+n]＝＝0,

J + n, j + b, j + c, j + d

Simultaneously ordering:

check[j+b]＝i,check[j+c]＝i,check[j+d]＝i.......check[j+n]＝i

and after all the states are set in a traversing way, the construction of the even number group is finished.

DAT queries are extremely convenient. When there are several characters in the word, the Chinese characters are converted into corresponding sequence codes, and then the corresponding word can be found by adding several times without halving the search. Since the average length of the chinese words does not exceed 4 chinese characters, the efficiency of the DAT query algorithm is extremely high.

(4) Query

The stage is summarized as follows:

1, two arrays: base [ ], check [ ].

2, each element in base [ ] corresponds to a node of the trie tree whose value is the base value for the transition to the next state.

3 check [ ], the previous state of the current state, for checking whether this state exists

4, for a transition from state s to state t, it must be satisfied that:

base [ s ] + c ═ t check [ base [ s ] + c ═ s where c is the input variable.

And (3) query flow:

there is a phrase "Aslin, Aston Martin, Astri, Odysi", the word code is as follows

1-A, 2-O, 3-Si, 4-tri, 5-lin, 6-ton, 7-ma, 8-d, 9-de, 10-sai

Processing to generate dictionary (as shown in FIG. 5)

And (3) inquiring:

input "Aston"

Coding 1 for 'a', base [1] ═ 1, and then, the input state is's' coding 3,

base[1]+3＝4，check[4]＝1

match, so 'As' is a state and base [4] >0, can continue

The input state is the 'ton' code 6,

base[4]+6＝8

check[8]＝4

thus, 'Ashin' is an end state, base [8] <0, 'Aston' is a word.

The common prefix of the character string is utilized to save storage space, the searching speed is high, the finite automata is utilized, each node represents a state, and the state conversion is carried out according to the difference of input variables. When the end state is reached or the transfer cannot be carried out, the query is completed, the root node does not contain characters, each node except the root node only contains one character, the characters passing through the path from the root node to a certain node are connected, and all child nodes of each node contain different characters for the character strings corresponding to the node.

The application has the advantages that: the method has the advantages that the AC state machine is fully utilized to complete pattern matching at a high speed, so that automobile related words in phrase texts are quickly recognized, the detailed intentions of users are obtained, the intentions are transmitted to the subsequent searching process, and the retrieval results are more in line with the expectations of the users.

It is well within the skill of those in the art to implement, without undue experimentation, the present application is not directed to software and process improvements, as they relate to circuits and electronic components and modules.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A search keyword extraction system based on a double-array dictionary tree is characterized in that: the system comprises a user interface (1), a query operation module (2), a retrieval module (3), a sorting module (4), a text operation module (5), an indexing module (6), an index module (7), a database management module (8), a text database module (9), a first word segmentation module (10) and a second word segmentation module (11), wherein the first word segmentation module (10) is arranged in the retrieval module (3), and the second word segmentation module (11) is arranged in the index module (7)

The user interface (1) is connected with the query operation module (2), the query operation module (2) is connected with the retrieval module (3), and the retrieval module (3) is connected with the sequencing module (4).

2. The system for extracting search keywords based on the double-array dictionary tree according to claim 1, wherein: the user interface (1) and the database management module (8) are connected with each other.

3. The system for extracting search keywords based on the double-array dictionary tree according to claim 1, wherein: the text operation module (5) and the database management module (8) are connected with each other.

4. The system for extracting search keywords based on the double-array dictionary tree according to claim 1, wherein: the text operation module (5) and the indexing module (6) are connected with each other.

5. The system for extracting search keywords based on the double-array dictionary tree according to claim 1, wherein: the indexing module (6) and the indexing module (7) are connected with each other.

6. The system for extracting search keywords based on the double-array dictionary tree according to claim 1, wherein: the index module (7) and the retrieval module (3) are connected with each other.

7. The system for extracting search keywords based on the double-array dictionary tree according to claim 1, wherein: the indexing module (6) is connected with the database management module (8).

8. The system for extracting search keywords based on the double-array dictionary tree according to claim 1, wherein: the database management module (8) and the text database module (9) are connected with each other.

9. The system for extracting search keywords based on the double-array dictionary tree according to claim 1, wherein: the user interface (1) is a third party packaging and HTTP protocol.

10. The system for extracting search keywords based on the double-array dictionary tree according to claim 1, wherein: the step of extracting the key words in the index module (7) is as follows:

base[j+b]＝＝0,base[j+c]＝＝0,base[j+d]＝＝0...base[j+n]＝＝0；

check[j+b]＝＝0,check[j+c]＝＝0.......check[j+n]＝＝0,

J + n, j + b, j + c, j + d

Simultaneously ordering:

check[j+b]＝i,check[j+c]＝i,check[j+d]＝i.......check[j+n]＝i

(4) query