CN103020311B - A kind of processing method of user search word and system - Google Patents

A kind of processing method of user search word and system Download PDF

Info

Publication number
CN103020311B
CN103020311B CN201310005804.6A CN201310005804A CN103020311B CN 103020311 B CN103020311 B CN 103020311B CN 201310005804 A CN201310005804 A CN 201310005804A CN 103020311 B CN103020311 B CN 103020311B
Authority
CN
China
Prior art keywords
user
term
word
entity
search
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310005804.6A
Other languages
Chinese (zh)
Other versions
CN103020311A (en
Inventor
车天文
雷大伟
石志伟
周步恋
杨振东
王更生
王喜民
何宏靖
徐忆苏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Easou World Polytron Technologies Inc
Original Assignee
Shenzhen Yisou Science & Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Yisou Science & Technology Development Co Ltd filed Critical Shenzhen Yisou Science & Technology Development Co Ltd
Priority to CN201310005804.6A priority Critical patent/CN103020311B/en
Publication of CN103020311A publication Critical patent/CN103020311A/en
Application granted granted Critical
Publication of CN103020311B publication Critical patent/CN103020311B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to information retrieval field, a kind of processing method of user search word is provided, comprise, set up the resources bank relevant to the core word of identification user search; Term to user's input carries out basic layering; Term after described basic layering is carried out to entity introducing; The hierarchical structure of the term that output identifies. The present invention also provides a kind for the treatment of system of user search word. Adopt technical scheme of the present invention, ensure the accuracy rate of entity extraction, to have avoided only relying on vocabulary to investigate the local optimum problem that level causes, and the problem to particular entity lack of identification that only relies on holistic approach sentence structure to cause. The final core word of further optimizing again retrieve statement by subordinate relation, the core vocabulary of identification user sentence, for search engine provides information support as much as possible. Not exclusively depend on the object information of search engine on line, be easier to operation and realize simultaneously.<!--1-->

Description

User search word processing method and system
Technical Field
The invention relates to the field of information retrieval, in particular to a method and a system for processing a user retrieval word.
Background
The advent of search engines has enabled users to have tools that can search and obtain information from a large amount of data. However, not every user knows the principle of the search engine, so most users generally organize query sentences by themselves to search when using the search engine, and it is considered that the more the input query words are, the more detailed the input query words are, the more satisfactory the search results can be obtained. In fact, on the one hand, the search engine has a maximum length limit on the query sentence input by the user based on performance considerations, and the search engine truncates beyond the maximum length and only partially searches. On the other hand, the returned result only contains the search word and returns, so that the accuracy is low and the real intention of the user cannot be hit.
Furthermore, current search engines introduce merchant advertisements as a means of revenue based on user input. But sometimes the ad is not as good as the user-entered information, horse cattle. The main reason is that the search engine fails to identify the core requirements of the user and only hits a part of the query words retrieved by the user.
Therefore, how to make the search result more satisfactory to the user's requirement, and more closely to the user's essential requirement, the search information input by the user needs to be understood. Considering the complexity of real-world languages, a search sentence input by a user has many words for limitation, and the words themselves have little practical meaning for searching. Therefore, the search engine needs to identify the core part or the stem part of the search, so that the core word and the stem word of the user search sentence are hit in the search result, and the non-hit word is some discarded words or modifiers with little meaning. How to extract corresponding core words from the search requirements of users becomes one of the problems which are urgently needed to be solved in the analysis of search terms (Query) in the current search engine.
When a user inputs a self retrieval statement, a search engine can automatically analyze the statement, identify a core word input by the user for retrieval, and obtain a search result only when the core word is hit; the discarded words or modifiers input by the user are identified, and the effect of the words with or without hits is not influenced. Therefore, the displayed retrieval results (including advertisements) can better meet the core requirements of the user.
So far, the schemes related to the search engine for identifying the user to search the core words are few, and the schemes are summarized to be not more than the following ones, one is to extract the corresponding core words based on the click information of the post search result; another is to analyze chinese semantics based on word architecture.
For example, the patent of chinese patent CN102043845A provides a method and apparatus for extracting core keywords based on query sequence clusters, including when search requirements of a large number of search results clicked by the same user appear in the network, these search requirements tend to reflect the same subject. The method comprises the steps of extracting corresponding core keywords by obtaining a query sequence cluster of a plurality of query sequences, wherein each query sequence at least corresponds to one same search result clicked by a user, obtaining the search requirements of the user inputting the query sequence in the query sequence cluster, and providing closer search suggestions or related search requirements for the user according to the core keywords, so that the user obtains better search experience. The disadvantages are that: firstly, the requirement on a search engine is high, the performance and the effect of the search engine are required to be stable, the search result can basically meet the requirement of a user, the obtained click result of the user is reliable, and the analysis and the processing based on the click result are consistent with the actual requirement of the user; secondly, the search result is generally obtained after the search of the user is processed, such as Query expansion, Query synonyms and the like, so that the search result does not necessarily contain the search word of the user, and thus the core word searched by the user cannot be directly extracted.
For example, the patent of chinese patent CN102681982A provides a method for automatically recognizing the semantic meaning of natural language sentences understandable by a computer, which provides a method for accurately understanding chinese language by a computer, and it abandons the conventional method for selecting characters and words, starting from the language features of chinese language, and accurately making the computer know the language contents input by an operator through a word architecture; the meaning of a sentence of Chinese is analyzed exactly. Firstly, establishing an ontology base in a certain field, and putting all accurately described unambiguous words in the certain field together to form the ontology base (comprising a field knowledge ontology base and a general word topic base); then establishing a semantic framework knowledge base based on the understanding of the natural language sentences and the domain ontology; and finally, realizing the visual matching of the natural language sentences to the semantic structure based on the ontology mapping of the semantic framework. The disadvantages are that: firstly, the information in the internet field is greatly increased every day, some new words are generated gradually, some common words have new meanings gradually, and the words are used as core words or modification auxiliary words and are related to user retrieval sentences and cannot be generalized; the semantic framework knowledge base is similar to the regular rule again, the number is huge, rapid induction cannot be achieved, and the effect needs to be further investigated and improved.
Core word recognition based on user retrieval of after-the-fact search firstly has higher requirements on a search engine and can be supported under the conditions of stable system performance and better effect; secondly, the method excessively depends on the reaction of the search result and the user, some unnecessary noises (such as advertisements, other information and the like) are easily introduced, the search result is obtained through various transformations, the search result does not necessarily contain the search terms of the user, and the search terms do not necessarily directly correspond to the search sentences. The result obtained off-line again can only play a role of reference when the subsequent user inputs the same and similar Query, so that the recall rate is low.
The core word recognition method based on the search of establishing the semantic framework knowledge base has insufficient processing on special entities and does not well distinguish entity words of the common word meaning; the semantic framework knowledge base is a rule consisting of various words, and long time is needed for sorting and induction, and the effect needs to be gradually improved.
Disclosure of Invention
The invention provides a method and a system for processing a user search word, which are used for solving the problem that the user search core word cannot be identified at present.
In order to solve the above problem, an embodiment of the present invention provides a method for processing a user search term, including,
establishing a resource library related to the core words searched by the identified user;
carrying out basic layering on the search terms input by the user;
carrying out entity introduction on the search terms after the basic layering;
and outputting the hierarchical structure of the identified search terms.
The method above, wherein the creating of the resource library associated with the core word retrieved by the recognition user comprises a series of word lists associated with the core word retrieved by the recognition user, including a stop word list, a modifier word list and an entity resource dictionary.
The method, wherein the basic layering of the search terms input by the user comprises,
after the word segmentation is carried out on the user retrieval sentence, a series of query words term and word parts pos are obtained, wherein the query words term and the word parts pos comprise term [1] _ pos [1], term [2] _ pos [2], … and term [ n ] _ pos [ n ], wherein term [ i ] is the ith word and pos [ i ] is the corresponding word part of speech;
the basic hierarchy of the query vocabulary input by the user is realized by using the disuse vocabulary, the modified vocabulary and the part of speech of the vocabulary of the resource library, and specifically, as follows,
level [ i ] = 0 term [ i ] &Element; stopwordList | | pos [ i ] cposList 1 term [ i ] &Element; mod ifywordList 2 other , i = 1,2 . . . n
wherein term [ i ] represents the ith term, level [ i ] is the corresponding hierarchy, stopword list is the stop word list, requireword list is the demand word list, cposList is a class of unimportant part of speech lists including but not limited to adjectives, adverbs, prepositions, exclamations, auxiliary words, moods, conjunctions, symbols;
if term [ i ] belongs to the stop word list or part of speech thereof belongs to cposList, level [ i ] is 0; if term [ i ] belongs to a modifier, level [ i ] is 1; the other case is 2.
The method mentioned above, wherein said entity introducing the basic layered search term includes,
extracting an actual entity word collection entityList according to the entity dictionary and retrieval sentences of the user;
level [ i ] = 2 term [ i ] &Element; entityList level [ i ] other , i = 1,2 . . . n
wherein term [ i ] represents the ith term, level [ i ] is the corresponding hierarchy, and entityList is the extracted entity set.
The method mentioned above, wherein said extracting the actual entity vocabulary entritylist according to the entity dictionary in combination with the user's search sentence comprises,
considering the relevance of user retrieval classification, and extracting entity words when the category of the entity is related to the classification information; or,
and extracting the entity words by using the sentence rules.
The method further comprises, before outputting the hierarchical structure of the identified user search term,
carrying out sentence syntactic analysis on the user search words; and/or the presence of a gas in the gas,
and performing dependency identification on the user search terms.
The embodiment of the invention also provides a processing system of the user search terms, which comprises,
the resource library establishing module is used for establishing a resource library related to the core words searched by the identified user;
the basic layering module is used for performing basic layering on the search terms input by the user;
the entity introducing module is used for introducing the entities of the search words after the basic layering;
and the output module is used for outputting the hierarchical structure of the identified search terms.
The system, wherein the resource library associated with the core word retrieved by the recognition user comprises a series of word lists associated with the core word retrieved by the recognition user, including a stop word list, a modifier word list and an entity resource dictionary.
The system, wherein the basic hierarchy for the search term input by the user specifically includes,
the basic layering module is used for obtaining a series of query vocabularies term and parts of speech pos after segmenting words of a user retrieval statement, wherein the query vocabularies term and the parts of speech pos comprise term [1] _ pos [1], term [2] _ pos [2], … and term [ n ] _ pos [ n ], wherein term [ i ] is the ith vocabulary, and pos [ i ] is the corresponding part of speech;
and for implementing a basic hierarchy of query vocabulary entered by a user using the disuse vocabulary, the modifier vocabulary, and the part-of-speech of the vocabulary of the repository, as follows,
level [ i ] = 0 term [ i ] &Element; stopwordList | | pos [ i ] cposList 1 term [ i ] &Element; mod ifywordList 2 other , i = 1,2 . . . n
wherein term [ i ] represents the ith term, level [ i ] is the corresponding hierarchy, stopword list is the stop word list, requireword list is the demand word list, cposList is a class of unimportant part of speech lists including but not limited to adjectives, adverbs, prepositions, exclamations, auxiliary words, moods, conjunctions, symbols;
if term [ i ] belongs to the stop word list or part of speech thereof belongs to cposList, level [ i ] is 0; if term [ i ] belongs to a modifier, level [ i ] is 1; the other case is 2.
The system described above, further comprising,
the sentence pattern syntactic analysis module is used for carrying out sentence pattern syntactic analysis on the user search words;
and the subordination relation identification module is used for identifying the subordination relation of the user search terms.
By adopting the technical scheme of the invention, the lexical characteristics of the retrieval sentences and the special functions of the entity words are considered, the entity is introduced to carry out entity disambiguation operation, the accuracy of entity extraction is ensured, and the integral retrieval sentences of the user are analyzed by virtue of sentence syntactic analysis, so that the problems of local optimum caused by only depending on the vocabulary to investigate the hierarchy and insufficient identification of the special entities caused by only depending on the integral research of sentence structures are avoided. And finally, further optimizing the core words of the retrieval sentences by means of the dependency relationship, and recognizing the core words of the user sentences to provide information support for the search engine as much as possible. Meanwhile, the method does not completely depend on the result information of the online search engine, and is easier to operate and implement.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:
FIG. 1 is a flow chart of a first embodiment of the present invention;
fig. 2 is a structural view of a second embodiment of the present invention.
Detailed Description
In order to make the technical problems, technical solutions and advantageous effects to be solved by the present invention clearer and clearer, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In searching, a user inputs a search term as needed, and the search term is generally composed of a plurality of search words. In view of the richness and complexity of the chinese language, the sentences inputted by the user for searching are various, and words are used for explaining the needs of the user in detail. However, in fact, many vocabularies can be used as words for auxiliary analysis, so that the meaning of expression is more clear, and the actual meaning of retrieval is not large. In the embodiment of the invention, the search terms contained in the search sentence of the user are divided into four levels:
the discarded words are words without actual meanings, such as stop words, punctuation marks and the like, and can be directly discarded without referring to search query, so that the retrieval efficiency can be improved without losing the retrieval effect;
the modifier, namely the modifier used when the user expresses the self semantic meaning, does not play an absolute role, is only rich in semantic meaning, and can be hit or not hit in the search result;
the core words, namely the cores of the user retrieval sentences, can express the words of the user search requirement information most, and the search results can be returned to the user only after being hit;
the term of demand, an attribute of what the user actually needs, is generally a supplement or emphasis of the user on demand, such as "download", "song", "lyrics", "movie", etc., and if the hit is better in the search result, it indicates that the resource is available, and the rank is top.
As shown in fig. 1, which is a flowchart of a first embodiment of the present invention, a method for processing a user search term is provided, which specifically includes,
step S101, establishing a resource library related to the core words searched by the identified user;
the repository is a series of vocabularies related to identifying core words retrieved by the user, including a stop word list (stopword list), a modifier word list (modifywordList), and a solid resource dictionary (dicResource).
The stop word list contains a series of stop words, such as "in", "what", which are common in Chinese; the modified word list comprises common modified words such as beautiful words, beautiful words and the like; the entity resource dictionary comprises current various resource names, such as channel resources like a novel name, a software name, a movie name and the like, and corresponding categories thereof, so that required information can be mined from the retrieval log or captured and extracted from each vertical website, and the completeness of the resource information of the resource library is ensured as much as possible.
Step S102, carrying out basic layering on the search terms input by the user;
when a user inputs a self retrieval statement, a series of query vocabularies term and parts of speech pos, term [1] _ pos [1], term [2] _ pos [2], … and term [ n ] _ pos [ n ] are obtained after the statement retrieved by the user is segmented. term [ i ] is the ith vocabulary, and pos [ i ] is its corresponding part of speech.
The basic hierarchy of the query vocabulary input by the user is realized by using the disuse vocabulary, the modified vocabulary and the part of speech of the vocabulary of the resource library, and specifically, as follows,
level [ i ] = 0 term [ i ] &Element; stopwordList | | pos [ i ] cposList 1 term [ i ] &Element; mod ifywordList 2 other , i = 1,2 . . . n
wherein term [ i ] represents the ith term, level [ i ] is the corresponding hierarchy, stopword list is the stop word list, requireword list is the demand word list, cposList is a class of unimportant part of speech lists including but not limited to adjectives, adverbs, prepositions, exclamations, auxiliary words, moods, conjunctions, symbols, and the like.
If term [ i ] belongs to the stop word list or part of speech belongs to cposList, level [ i ] is 0 (representing a discarded word); if term [ i ] belongs to a modifier, level [ i ] is 1 (representing the modifier); the other case is 2 (core word).
By this step, each vocabulary retrieved by the user is preliminarily set to a hierarchy.
Step S103, entity introduction is carried out on the search words after basic layering;
the importance and the grade of the words contained in the retrieval sentences input by the user are different, and how to distinguish the more important words with the representative meaning is relatively more important, and the entity words are more important and can more generally display the intention requirements of the user. If the search sentence contains entity words, the role of the entity words is highlighted.
The entity introduction is mainly to salvage the important vocabulary divided into modifiers or discarded words in the basic hierarchy and endow the important hierarchy with the important vocabulary again.
In view of the importance and complexity of an entity, it is necessary to determine whether the entity is an entity in combination with the input of the user. Such as "why" is one of the most common words, but may also exist in a physical dictionary, with the category being songs. How to distinguish such words, especially ambiguous entity words, is the most important step in this link and can be called as an entity disambiguation method.
Two methods are considered to extract the entity, wherein the first method considers that the user searches for classification correlation, and the entity is extracted when the classification of the entity is correlated with classification information, otherwise, the entity is not used.
Specifically, the first method is to use external information such as Query classification (classification of user retrieval statement), which is common in search engines. If the user searches for 'fun downloading of funny movies on a planet' class, the Query class is a downloading class; "why songs in May days were listened to" Query category is song category; the Query class is a question-answer class.
The entity that extracts the search statement makes use of this category information. If "kungfu" is a general word, but in the above user search, it is actually the name of a movie and is an entity name, and the entity category is a movie class (by calling the above entity resource dictionary, candidate entity words and entity categories in the sentence input by the user can be obtained), and when the Query category (download class) is associated with the entity category (movie), it is extracted. Further, as to why the term belongs to stop words, the basic hierarchy has been divided into the discarded word levels in the first step, and the discarded word levels are presented as candidate entities through the entity resource dictionary, and the entity is a song category (the name of the song is called why), and the Query category (song category) is related to the entity category (song) and is considered as an entity. In "why the mobile phone is not connected to the computer", even if "why" appears as a candidate entity, the Query class (question-answer class) is not associated with the entity class (song), and is not considered as an entity.
The association can be manually and flexibly configured with an association table, which indicates which entity categories each Query category may be associated with, such as "download class: songs, movies, dramas, games, software "; "songs: song "; "video class: movies, television shows, animations ", etc.
Of course, it is practical that not every Query has a category. What do if the user retrieved the sentence with no categories? According to experience, if the Query contains obvious entity words, the Query can basically separate the categories, and if the categories are not separated, the Query can directly select the preferred categories according to the length of the candidate entities and the number of the cut words, so that the accuracy is guaranteed.
The main meaning of entity introduction is to "drag" the core word. After basic layering, a basic approximate layering is achieved according to the literal meaning, but common words may discard word or modifier hierarchies; and the words can be found to be rather important entity words through careful analysis, and then the words are fished back to be given a core word grade. For example, "because of love", the word segmentation is "because of love", and "because" is too common, the word is given to be discarded in the basic hierarchy. But it is part of the entity (song "because of love") that will be assigned a core word rank at this step. As mentioned above, the entity introduction is mainly the entity disambiguation, i.e. how to extract the truly useful entities, and less introduction of noise, guarantee of recall rate and accuracy, which thinks of the above two methods.
Of course, the first method relies on external Query classification, and the accuracy is high.
2) And (3) sentence rule extraction: if T appears in the entity dictionary, it is extracted as (name | demand word) + word T, (name) word T + (demand word). Such as "why a zeita song" and "why a song" retrieved by the user, at which time "why" can be considered an entity.
The second approach starts directly from some rules, such as the fact that entity words will generally appear together with names of people, demand words (songs, movies, etc.), especially for general meaning. As mentioned above, "why" here is the entity, "why" here is not the entity, the mobile phone connection to the computer, "why" here is not the entity, the method is easy to implement.
And extracting the actual entity word set entityList according to the entity dictionary and retrieval sentences of the user.
level [ i ] = 2 term [ i ] &Element; entityList level [ i ] other , i = 1,2 . . . n
Wherein term [ i ] represents the ith term, level [ i ] is the corresponding hierarchy, and entityList is the extracted entity set.
This step is intended to raise the hierarchical level, highlighting the user's intent, with the vocabulary of entities contained in the user's search statement (basic hierarchy may be given discarding or embellishment).
And step S106, outputting the hierarchical structure of the identified user search terms.
And finally obtaining the hierarchical structure corresponding to each vocabulary contained in the sentence aiming at each retrieval sentence by the steps, namely whether the vocabulary is a demand word, a core word, a modifier word or a discarded word.
The above steps basically complete the recognition of the search term input by the user, but if a better effect is to be achieved, the embodiment of the present invention may further include the following steps, where the following two steps S104 and S105 are not in sequence, and may also be selected for use:
step S104, carrying out sentence syntactic analysis on the user search words;
the above two steps realize the layering of the vocabulary input by the user through the basic layering and the entity introduction of the vocabulary input by the user, but realize the layering based on the angle of the words. The retrieval sentence inputted by user contains many fixed sentence patterns, and uses some sentence pattern rules to assist in layering. Such as (from) $ Adress., $ Adress; (. mobile phone) download; (discussion.) and (relationship); the words in parentheses may be assigned a hierarchy of modifications.
In addition, dependency syntax analysis can be performed on the user search sentence, the sentence structure is analyzed, the dependency relationship between words and phrases contained in the sentence is obtained, and the hierarchical structure of the words is adjusted based on the angle of the sentence by using a special sentence structure.
This step is to grasp the user input sentence as a whole and adjust the hierarchy of the vocabulary.
Step S105, the user search term is subjected to dependency relationship identification.
As an example, embodiments of the present invention separate dependencies into two categories: regional affiliations and industry affiliations.
And when the two place names are in a subordinate relationship and an upper-level and lower-level relationship, the upper-level address is adjusted to be decorated. To highlight the core place name. If "beijing hailake" belongs to beijing, then "hailake" will be more inclined to the core word than "beijing", where "beijing" is adjusted to be a modifier, and the area affiliation can be considered to be identified by the location name code.
The domain subordination is the category domain to which the entity name belongs, such as TV play, movie, song, etc., and the information comes from the entity dictionary. After the entity identification in 103, according to the entity category, if the related words of the category appear before and after the entity, the words are adjusted to be the demand words. In essence, the requirement word is an attribute that indicates what the user is searching for, and is therefore related to a specific entity, ontology, and generally appears along with the entity. Therefore, after the entity is identified, the dependency relationship is determined whether there is a demand word. For example, the song forgetting water of Liu Dehua belongs to the song, so the word song is adjusted to be a demand word, and the core words are Liu Dehua and the forgetting water. Therefore, the core words of the sentences input by the user can be clarified, and the essential requirements (songs) of the user can be clarified to carry out search sequencing optimization. If the user inputs "movie in Liu De Hua", the dependency relationship is not recognized, the word "movie" is still the core word and is not recognized as the demand word, otherwise the retrieval result may not be related to the movie.
As shown in fig. 2, which is a block diagram of a second embodiment of the present invention, there is provided a system for processing a user search term, comprising,
a resource library establishing module 201, configured to establish a resource library related to the core word retrieved by the identified user;
a basic layering module 202, configured to perform basic layering on a search term input by a user;
an entity introducing module 203, configured to perform entity introduction on the search terms after the basic layering;
and the output module 204 is configured to output the hierarchical structure of the identified search term.
Further, the establishing of the resource library related to the core words retrieved by the recognition users comprises a series of word lists related to the core words retrieved by the recognition users, including a stop word list, a modified word list and an entity resource dictionary.
Further, the basic layering module is used for performing basic layering on the search terms input by the user specifically comprises,
the basic layering module is used for obtaining a series of query vocabularies term and parts of speech pos after segmenting words of a user retrieval statement, wherein the query vocabularies term and the parts of speech pos comprise term [1] _ pos [1], term [2] _ pos [2], … and term [ n ] _ pos [ n ], wherein term [ i ] is the ith vocabulary, and pos [ i ] is the corresponding part of speech;
and for implementing a basic hierarchy of query vocabulary entered by a user using the disuse vocabulary, the modifier vocabulary, and the part-of-speech of the vocabulary of the repository, as follows,
level [ i ] = 0 term [ i ] &Element; stopwordList | | pos [ i ] cposList 1 term [ i ] &Element; mod ifywordList 2 other , i = 1,2 . . . n
wherein term [ i ] represents the ith term, level [ i ] is the corresponding hierarchy, stopword list is the stop word list, requireword list is the demand word list, cposList is a class of unimportant part of speech lists including but not limited to adjectives, adverbs, prepositions, exclamations, auxiliary words, moods, conjunctions, symbols.
If term [ i ] belongs to the stop word list or part of speech thereof belongs to cposList, level [ i ] is 0; if term [ i ] belongs to a modifier, level [ i ] is 1; the other case is 2.
Further, the system also includes,
the sentence pattern syntactic analysis module is used for carrying out sentence pattern syntactic analysis on the user search words; and/or
And the subordination relation identification module is used for identifying the subordination relation of the user search terms.
While the foregoing description shows and describes a preferred embodiment of the invention, it is to be understood, as noted above, that the invention is not limited to the form disclosed herein, but is not intended to be exhaustive or to exclude other embodiments and may be used in various other combinations, modifications, and environments and may be modified within the scope of the inventive concept described herein by the above teachings or the skill or knowledge of the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (8)

1. A processing method of user search terms is characterized by comprising the following steps,
establishing a resource library related to the core words searched by the identified user;
carrying out basic layering on the search terms input by the user;
carrying out entity introduction on the search terms after the basic layering;
outputting the hierarchical structure of the identified search terms;
the basic layering of the search terms input by the user specifically comprises:
after the word segmentation is carried out on the user retrieval sentence, a series of query words term and word parts pos are obtained, wherein the query words term and the word parts pos comprise term [1] _ pos [1], term [2] _ pos [2], … and term [ n ] _ pos [ n ], wherein term [ i ] is the ith word and pos [ i ] is the corresponding word part of speech;
the basic hierarchy of the query vocabulary input by the user is realized by using the disuse vocabulary, the modified vocabulary and the part of speech of the vocabulary of the resource library, and specifically, as follows,
l e v e l &lsqb; i &rsqb; = 0 t e r m &lsqb; i &rsqb; &Element; s t o p w o r d L i s t | | p o s &lsqb; i &rsqb; &Element; c p o s L i s t 1 t e r m &lsqb; i &rsqb; &Element; mod i f y w o r d L i s t 2 o t h e r , i = 1 , 2 ... n
wherein term [ i ] represents the ith term, level [ i ] is the corresponding hierarchy, stopword List is the disuse word list, modifywordList is the modifier word list, cposList is a class of unimportant word list including but not limited to adjectives, adverbs, prepositions, interjections, helpwords, lingers, conjunctions, symbols;
if term [ i ] belongs to the stop word list or part of speech thereof belongs to cposList, level [ i ] is 0; if term [ i ] belongs to a modifier, level [ i ] is 1; the other case is 2.
2. The method of claim 1, wherein the creating a resource library associated with the identified user-retrieved core word comprises a list of vocabularies associated with the identified user-retrieved core word, including stop vocabularies, modifier vocabularies, and entity resource dictionaries.
3. The method of claim 2, wherein said entity importing the base layered search term comprises,
extracting an actual entity word collection entityList according to the entity dictionary and retrieval sentences of the user;
l e v e l &lsqb; i &rsqb; = 2 t e r m &lsqb; i &rsqb; &Element; e n t i t y L i s t l e v e l &lsqb; i &rsqb; o t h e r , i = 1 , 2 ... n
wherein term [ i ] represents the ith term, level [ i ] is the corresponding hierarchy, and entityList is the extracted entity set.
4. The method of claim 3, wherein extracting the actual entity vocabulary entrityList from the entity dictionary in conjunction with the user's search statement comprises,
considering the relevance of user retrieval classification, and extracting entity words when the category of the entity is related to the classification information; or,
and extracting the entity words by using the sentence rules.
5. The method of any of claims 1 to 4, further comprising, prior to outputting the hierarchy of identified user terms,
carrying out sentence syntactic analysis on the user search words; and/or the presence of a gas in the gas,
and performing dependency identification on the user search terms.
6. A system for processing a user search term, comprising,
the resource library establishing module is used for establishing a resource library related to the core words searched by the identified user;
the basic layering module is used for performing basic layering on the search terms input by the user;
the entity introducing module is used for introducing the entities of the search words after the basic layering;
the output module is used for outputting the hierarchical structure of the identified search terms;
the basic layering module is used for performing basic layering on a search term input by a user, and specifically comprises the basic layering module, a query term and a part of speech pos, wherein the query term and the part of speech pos comprise term [1] _ pos [1], term [2] _ pos [2], … and term [ n ] _ pos [ n ], the term [ i ] is the ith vocabulary, and the pos [ i ] is the corresponding part of speech;
and for implementing a basic hierarchy of query vocabulary entered by a user using the disuse vocabulary, the modifier vocabulary, and the part-of-speech of the vocabulary of the repository, as follows,
l e v e l &lsqb; i &rsqb; = 0 t e r m &lsqb; i &rsqb; &Element; s t o p w o r d L i s t | | p o s &lsqb; i &rsqb; &Element; c p o s L i s t 1 t e r m &lsqb; i &rsqb; &Element; mod i f y w o r d L i s t 2 o t h e r , i = 1 , 2 ... n
wherein term [ i ] represents the ith term, level [ i ] is the corresponding hierarchy, stopword List is the disuse word list, modifywordList is the modifier word list, cposList is a class of unimportant word list including but not limited to adjectives, adverbs, prepositions, interjections, helpwords, lingers, conjunctions, symbols;
if term [ i ] belongs to the stop word list or part of speech thereof belongs to cposList, level [ i ] is 0; if term [ i ] belongs to a modifier, level [ i ] is 1; the other case is 2.
7. The system of claim 6, wherein the repository associated with the identified user-retrieved core word comprises a list of vocabularies associated with the identified user-retrieved core word, including a stop vocabulary, a modifier vocabulary, and a physical resource dictionary.
8. The system of claim 7, further comprising,
the sentence pattern syntactic analysis module is used for carrying out sentence pattern syntactic analysis on the user search words; and/or the presence of a gas in the gas,
and the subordination relation identification module is used for identifying the subordination relation of the user search terms.
CN201310005804.6A 2013-01-08 2013-01-08 A kind of processing method of user search word and system Active CN103020311B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310005804.6A CN103020311B (en) 2013-01-08 2013-01-08 A kind of processing method of user search word and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310005804.6A CN103020311B (en) 2013-01-08 2013-01-08 A kind of processing method of user search word and system

Publications (2)

Publication Number Publication Date
CN103020311A CN103020311A (en) 2013-04-03
CN103020311B true CN103020311B (en) 2016-05-18

Family

ID=47968914

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310005804.6A Active CN103020311B (en) 2013-01-08 2013-01-08 A kind of processing method of user search word and system

Country Status (1)

Country Link
CN (1) CN103020311B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107491556A (en) * 2017-09-04 2017-12-19 湖北地信科技集团股份有限公司 Space-time total factor semantic query service system and its method
CN109492214B (en) * 2017-09-11 2023-09-19 苏州大学 Attribute word recognition and hierarchy construction method, device, equipment and storage medium
CN107992586A (en) * 2017-12-08 2018-05-04 成都谷问信息技术有限公司 Search method based on the intelligent meaning of one's words
CN112069801B (en) * 2020-09-14 2024-09-20 深圳前海微众银行股份有限公司 Sentence trunk extraction method, device and readable storage medium based on dependency syntax
CN112800175B (en) * 2020-11-03 2022-11-25 广东电网有限责任公司 Cross-document searching method for knowledge entities of power system

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101510221A (en) * 2009-02-17 2009-08-19 北京大学 Enquiry statement analytical method and system for information retrieval

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5721902A (en) * 1995-09-15 1998-02-24 Infonautics Corporation Restricted expansion of query terms using part of speech tagging

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101510221A (en) * 2009-02-17 2009-08-19 北京大学 Enquiry statement analytical method and system for information retrieval

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于短语识别的自然语言理解搜索方法研究;齐波;《中国优秀硕士学位论文全文数据库 信息科技辑》;20080515(第5期);第18-42页,附图4.2、4.9 *

Also Published As

Publication number Publication date
CN103020311A (en) 2013-04-03

Similar Documents

Publication Publication Date Title
CN108304375B (en) Information identification method and equipment, storage medium and terminal thereof
US10896212B2 (en) System and methods for automating trademark and service mark searches
CN108287858B (en) Semantic extraction method and device for natural language
CN106649818B (en) Application search intention identification method and device, application search method and server
CN109726274B (en) Question generation method, device and storage medium
CN100458795C (en) Intelligent word input method and input method system and updating method thereof
US20160041986A1 (en) Smart Search Engine
CN102253930B (en) A kind of method of text translation and device
CN103106287B (en) A kind of processing method and system of user search sentence
CN106446018B (en) Query information processing method and device based on artificial intelligence
US20180004838A1 (en) System and method for language sensitive contextual searching
CN107408107A (en) Text prediction is integrated
CN105975558A (en) Method and device for establishing statement editing model as well as method and device for automatically editing statement
CN108038099B (en) Low-frequency keyword identification method based on word clustering
CN103020311B (en) A kind of processing method of user search word and system
CN109271492A (en) Automatic generation method and system of corpus regular expression
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
CN101556596A (en) Input method system and intelligent word making method
CN111488429A (en) Short text clustering system based on search engine and short text clustering method thereof
US20220365956A1 (en) Method and apparatus for generating patent summary information, and electronic device and medium
CN109298796B (en) Word association method and device
CN112765977B (en) Word segmentation method and device based on cross-language data enhancement
JP2006227823A (en) Information processor and its control method
CN113779987A (en) Event co-reference disambiguation method and system based on self-attention enhanced semantics
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CP03 Change of name, title or address
CP03 Change of name, title or address

Address after: 518057 C Building 5, Nanshan District software industry base, Shenzhen, Guangdong 403-409, China

Patentee after: Shenzhen easou world Polytron Technologies Inc

Address before: 518026 Guangdong city of Shenzhen province Futian District Binhe Road and CaiTian Road Interchange Union Square Tower A, A5501-A

Patentee before: Shenzhen Yisou Science & Technology Development Co., Ltd.

PE01 Entry into force of the registration of the contract for pledge of patent right
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: Method and system for processing user search terms

Effective date of registration: 20170918

Granted publication date: 20160518

Pledgee: Shenzhen SME financing Company limited by guarantee

Pledgor: Shenzhen easou world Polytron Technologies Inc

Registration number: 2017990000881

PC01 Cancellation of the registration of the contract for pledge of patent right
PC01 Cancellation of the registration of the contract for pledge of patent right

Date of cancellation: 20200428

Granted publication date: 20160518

Pledgee: Shenzhen SME financing Company limited by guarantee

Pledgor: Shenzhen easou world Polytron Technologies Inc

Registration number: 2017990000881