CN104850554B - Searching method and system - Google Patents

Searching method and system Download PDF

Info

Publication number
CN104850554B
CN104850554B CN201410051875.4A CN201410051875A CN104850554B CN 104850554 B CN104850554 B CN 104850554B CN 201410051875 A CN201410051875 A CN 201410051875A CN 104850554 B CN104850554 B CN 104850554B
Authority
CN
China
Prior art keywords
semantic
words
entity
word string
query word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410051875.4A
Other languages
Chinese (zh)
Other versions
CN104850554A (en
Inventor
张友书
张坤
张阔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sogou Technology Development Co Ltd
Original Assignee
Beijing Sogou Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sogou Technology Development Co Ltd filed Critical Beijing Sogou Technology Development Co Ltd
Priority to CN201410051875.4A priority Critical patent/CN104850554B/en
Publication of CN104850554A publication Critical patent/CN104850554A/en
Application granted granted Critical
Publication of CN104850554B publication Critical patent/CN104850554B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a searching method and a searching system, wherein the method comprises the following steps: when a query word string is received, performing semantic analysis on the query word string to obtain a semantic expression corresponding to the query word string; matching analysis is carried out by combining the semantic expression, and the semantic label of each word in the current query word string is determined; rewriting the query word string according to the semantic tag; and searching by using the rewritten query word string to obtain matched network information. According to the method and the device, semantic analysis is carried out on the query word string to obtain the semantic expression, the semantic label to which each word belongs in the semantic expression conforming to the current context is further determined, the query word string is rewritten based on the semantic label, the user intention is better met, the success rate of information matching during searching is high, and the searching quality and the searching efficiency are improved.

Description

Searching method and system
Technical Field
The present application relates to the field of computer technologies, and in particular, to a search method and a search system.
Background
Query rewrite is to rewrite the original query terms input by the user in the search engine query process to return better search results. In the prior art, query rewrite is mainly to correct user input errors. Such as: when the user inputs 'go-to-conclusion', 'zoujielilun' or 'zhoujielilun', the search engine has difficulty in finding the correct web page for the user. After the query is corrected, the query is analyzed in the error correction model according to the zoujielilun, the proportion of the text matching result corresponding to the Zhoujilun in the analyzed result is large, the text matching result is modified into the query word Zhoujilun which accords with the original intention of the user, and the search engine can return the webpage which accords with the intention of the user under the condition that the user does not intervene, so that the user experience is improved.
The existing web page search technology mainly carries out inquiry based on key words. When a user inputs search information of the search terms, the search engine carries out Chinese word segmentation on the search terms, converts the search terms into a plurality of key words, then goes to an inverted index library of web pages for searching, returns the web pages hitting the key words, then adopts a certain sorting algorithm to sort the hit web pages from the aspects of relevance, timeliness, user intention and the like, and returns the web page links to the user in sequence.
The existing search technology based on keywords, namely the search mode of 'query word- > keyword- > search' which depends on character string matching, simply segments the query word, easily loses part of information, deviates from the intention of a user, and thus effective results cannot be obtained through the keywords.
For example, as shown in fig. 1, when a search engine searches for a query word "who is a son of the xungfeng", the keywords obtained after the word segmentation are "xungfeng", "who", and "son", and the search is performed by using the three keywords, because the frequency of occurrence of "lucas" in the network is much higher than the frequency of occurrence of "xungxian", most of the web pages returned by simply depending on text matching describe "son of the xungfeng", that is, web pages related to lucas, the matching success rate corresponding to the search result obtained by simply depending on matching is often low, and it is difficult to meet the user requirements.
Disclosure of Invention
The technical problem to be solved by the application is to provide a searching method and a searching system, and solve the problems that in the prior art, the matching success rate of a searching result is low in the process of solving and searching problems, and the user requirements are difficult to meet.
In order to solve the above problem, the present application discloses a search method, including:
when a query word string is received, performing semantic analysis on the query word string to obtain a semantic expression corresponding to the query word string;
matching analysis is carried out by combining the semantic expression, and the semantic label of each word in the current query word string is determined;
rewriting the query word string according to the semantic tag;
and searching by using the rewritten query word string to obtain matched network information.
Preferably, when a query word string is received, performing semantic analysis on the query word string to obtain a semantic expression corresponding to the query word string includes:
searching entity words corresponding to the query word string in an entity word list preset in a knowledge base;
and searching the attribute words corresponding to the query word string in an attribute word list preset in a knowledge base.
Preferably, the step of determining the semantic label to which each word in the current query word string belongs includes:
extracting preset semantic tags of the attribute words;
marking one or more original semantic labels on the entity words;
respectively judging whether the entity words marked with the original semantic tags have a predefined association relationship with the attribute words marked with the semantic tags; if so, determining that the original semantic label with the predefined association relationship is the semantic label to which the entity word belongs.
Preferably, the step of rewriting the query word string according to the semantic tag includes:
searching preset identification entity words by adopting the semantic tags;
replacing the entity words with preset identification entity words;
and/or the presence of a gas in the gas,
replacing the attribute words with preset identification attribute words;
and/or the presence of a gas in the gas,
judging whether the query word string accords with a syntactic rule of reverse expression; if yes, acquiring a corresponding preset expression which is stored in the server and corresponds to the syntax rule which accords with the forward expression; the preset expression has use frequency;
and when the use frequency of the preset expression is higher than a preset threshold value, rewriting the query word string according to a syntactic rule of forward expression.
Preferably, the entity words are identified as entity words which have the same semantic labels as the entity words and are used most frequently;
the identification attribute words are attribute words which describe the same kind of entity words and are used most frequently.
Preferably, the step of determining whether the query word string conforms to a syntactic rule of a reverse expression includes:
performing syntactic analysis on the query word string to obtain a subject and a modifier, and a dependency relationship between the subject and the modifier; the dependency relationship comprises a dependency relationship that the subject depends on the modifier;
and when the subject is the entity word, the modifier word is the attribute word, and the dependency relationship is the dependency relationship that the subject depends on the modifier word, the query word string conforms to the syntactic rule of reverse expression.
The present application also discloses a search system, comprising:
the part-of-speech analysis module is used for performing semantic analysis on the query word string when the query word string is received to obtain a semantic expression corresponding to the query word string;
the semantic tag determining module is used for performing matching analysis by combining the semantic expression and determining the semantic tag of each word in the current query word string;
the rewriting module is used for rewriting the query word string according to the semantic label;
and the query module is used for searching by using the rewritten query word string to obtain matched network information.
Preferably, the part of speech parsing module includes:
the entity word searching module is used for searching entity words corresponding to the query word string in an entity word list preset in a knowledge base;
and the attribute word searching module is used for searching the attribute words corresponding to the query word string in an attribute word list preset in a knowledge base.
Preferably, the semantic tag determining module comprises:
the extraction submodule is used for extracting the preset semantic tags of the attribute words;
a marking submodule for marking the entity word with one or more original semantic tags;
the incidence relation judging module is used for respectively judging whether the entity words marked with the original semantic labels have predefined incidence relation with the attribute words marked with the semantic labels; if yes, calling a determining submodule;
and the determining submodule is used for determining that the original semantic label with the predefined association relationship is the semantic label to which the current entity word belongs.
Preferably, the rewriting module includes:
the identification entity word searching submodule is used for searching preset identification entity words by adopting the semantic tags;
the identification entity word replacing submodule is used for replacing the entity words with preset identification entity words;
and/or the presence of a gas in the gas,
the mark attribute word replacing submodule is used for replacing the attribute words with preset mark attribute words;
and/or the presence of a gas in the gas,
the reverse expression judging submodule is used for judging whether the query word string accords with a reverse expression syntactic rule or not; if yes, calling a preset expression obtaining submodule;
the preset expression obtaining submodule is used for obtaining a corresponding preset expression which is stored in the server and accords with the forward expression syntax rule; the preset expression has use frequency;
and the forward expression rewriting submodule is used for rewriting the query word string according to a forward expression syntactic rule when the use frequency of the preset expression is higher than a preset threshold value.
Preferably, the entity words are identified as entity words which have the same semantic labels as the entity words and are used most frequently;
the identification attribute words are attribute words which describe the same kind of entity words and are used most frequently.
Preferably, the reverse expression judgment submodule includes:
the syntax analysis submodule is used for carrying out syntax analysis on the query word string to obtain a subject and a modifier and a dependency relationship between the subject and the modifier; the dependency relationship comprises a dependency relationship that the subject depends on the modifier;
and the judging submodule is used for judging that the query word string conforms to the syntactic rule of the reverse expression when the subject is the entity word, the modifier word is the attribute word and the dependency relationship is the dependency relationship of the subject on the modifier word.
Compared with the prior art, the method has the following advantages:
according to the method and the device, semantic analysis is carried out on the query word string to obtain the semantic expression, the semantic label to which each word belongs in the semantic expression conforming to the current context is further determined, the query word string is rewritten based on the semantic label, the user intention is better met, the success rate of information matching during searching is high, and the searching quality and the searching efficiency are improved.
According to the method and the device, the entity words and the attribute words are rewritten into the entity identification words and the attribute identification words which are friendly to a search engine, the query word string which is not commonly used and is reversely expressed is rewritten into the query word string which is commonly used and is forwardly expressed, the coverage rate of search information of the search engine is improved, and the success rate of information matching is further improved.
Drawings
FIG. 1 is an exemplary diagram of a search result of the prior art;
FIG. 2 is a flow chart of the steps of one embodiment of a search method of the present application;
FIG. 3 is an exemplary diagram of a forward expression rewrite of the present application;
FIG. 4 is an exemplary diagram of a search result of the present application;
FIG. 5 is a block diagram of a search system embodiment of the present application.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present application more comprehensible, the present application is described in further detail with reference to the accompanying drawings and the detailed description.
The knowledge base is a structured, easy-to-operate, easy-to-use, comprehensive and organized knowledge cluster in knowledge engineering, and is an interconnected knowledge slice set which is stored, organized, managed and used in a computer memory by adopting a certain knowledge representation mode (or a plurality of knowledge representation modes) according to the requirement of solving problems in a certain field (or certain fields). These knowledge pieces include theoretical knowledge related to a field, fact data, heuristic knowledge derived from expert experience, such as definition, theorem and algorithm related to a field, common sense knowledge, and the like.
One of the core ideas of the application is that the query word string is rewritten according to the grammar specification based on the knowledge base so as to obtain a search result which more comprehensively conforms to the intention of the user.
Referring to FIG. 2, a flow chart of steps of an embodiment of a search method of the present application is shown.
Step 201, when a query word string is received, performing semantic analysis on the query word string to obtain a semantic expression corresponding to the query word string;
the query word string may be a phrase or sentence input by a user at a client (e.g., a web page of a search engine, a search plug-in of a browser, etc.) for requesting a search for information related thereto.
For the query word string, semantic analysis is required, which may specifically include judging whether the query word string exceeds a preset length, performing word segmentation on the query word string, and the like, and then identifying entity words and attribute words in the query word string.
In a preferred embodiment of the present application, the step 201 may specifically include the following sub-steps:
substep S11, searching the attribute words corresponding to the query word string in an attribute word list preset in a knowledge base;
and a substep S12, searching the entity words corresponding to the query word string in an entity word list preset in a knowledge base.
By applying the embodiment of the application, the knowledge base can be analyzed and constructed in advance according to the data captured in the whole network. Specifically, the knowledge base may store an entity word list and an attribute word list.
In the entity word list, entity words collected in advance can be recorded; in the attribute word list, attribute words collected in advance may be recorded.
Based on a Resource Description Framework (RDF), i.e. a data model of network Resource objects and relationships between them, triples in the form of "entity-attribute-value" may be used to describe various resources and relationships between them.
1. Entity: a corresponding specific individual in the star category, such as liudelhi, zhangbaizhi, cichorium linnaeus, etc., also encompasses a broad representative category of individuals, such as people, movie stars, singers, etc.
2. The attributes are as follows: which is a property included in an entity, each attribute has a type variable reflecting an attribute value type, such as height: length, [ age: integer ], [ date of birth ], in addition to an attribute name.
3. Attribute values: the values corresponding to the attributes, such as 168cm (height), 87kg (weight), etc., are the knowledge in the knowledge base. The attribute values also record the knowledge source and are used for helping the user judge the reliability of the knowledge.
Wherein the attribute words can be obtained by mining web pages and search logs.
The RDF-based triple "entity-attribute-value" can find out the attribute word describing "husband-wife relationship" in the following way, if the entity is "liu de hua", the attribute "wife relationship", and the value is "mercury mercy":
1. and mining the webpage and the search log to obtain a text fragment between the entity and the value. For example, "Liu De Hua Lao Qiao Zhu Li", "Liu De Hua Tai Zhu Yu Li, and" Feng Xiao Fang Ma Nami ".
2. The frequency of use of text snippets between individual "entity-values" is counted. For example, the frequency of use of "Liu De Lao Zhu Yu Qian" is 2, the frequency of use of "Liu De Hua Tai Zhu Yu Qian" is 3,
the frequency of use of "von willebrand's wife xufan" was 2.
3. And counting the use frequency of the text fragments among the same type of entity-value. For example, the wife < value > "of" < entity > is used with a frequency of 4, and the tai < value > "of" < entity > is used with a frequency of 3.
4. And extracting attribute words exceeding a preset time threshold value from the text fragments. For example, if the threshold of the number of times is 2, and a text segment whose usage frequency exceeds 2 is extracted as the attribute word, the attribute words corresponding to "wife relationship" can be found as "wife" and "tai".
Step 202, performing matching analysis by combining the semantic expression, and determining semantic labels to which all words in the query word string belong;
and performing syntactic analysis on the query word string with the entity words and the attribute words identified by the method which is based on the knowledge base and is irrelevant to the context to obtain the association relation between the entity words and the attribute words, and further identifying the semantic tags of the entity words which accord with the current context.
A context-free method, also called type 2 grammar, is a transformation grammar in formal language theory, used to describe context-free languages. Specifically, a set of grammatical rules is defined, which can be used for syntactic analysis to obtain sentence structures and the association between sentence components. In particular, the grammar rules may be stored in a knowledge base.
In a preferred embodiment of the present application, the step 202 may specifically include the following sub-steps:
a substep S21 of extracting preset semantic tags of the attribute words;
the attribute words may have semantic tags with defined meanings, stored in a knowledge base.
Substep S22, labeling the entity word with one or more original semantic tags;
the original semantic tags may be information expressing the meaning of the entity words.
For example, for the query string "show on which day of luaojianghu," which is an entity word, there may be many original semantic labels, such as movies, dramas, novels, dramas, games, etc.
Substep S23, respectively determining whether the entity words marked with original semantic tags have predefined association relationship with the attribute words marked with semantic tags; if yes, go to substep S24;
for example, if a grammar rule < entity _ person > < attribute _ wife > is defined as having an association, then for the query word string "grandma in liu de hua", the corresponding semantic expression may be "grandma in liu de hua < entity _ person >", and by checking that < entity _ person > < attribute _ wife > satisfies the requirements of the grammar rule, it is legal, i.e., has a predefined association, so that it can be obtained that < attribute _ wife > grandma depends on < entity _ person > liu de hua.
Further, assuming that < entity _ person > < attribute _ height > is not predefined, then it is illegal to query the height < attribute _ height > of "liudeluxe < entity _ person > identified by the word string" height of liudeluxe ", with no predefined association.
And a substep S24, determining that the original semantic label having the predefined association relationship is the semantic label to which the entity word belongs currently.
For the query word string "show in what day of the luaojianghu lake", the "show in what day" is obtained by syntactic analysis to modify "luaojianghu", and the "show in what day" is the attribute of the "movie" category entity can be analyzed by the grammar rule, so that it can be determined that "luaojianghu lake" is a movie, not a tv drama, a novel, a game, etc.
Step 203, rewriting the query word string by using the semantic label;
in the embodiment of the application, the query word string with the entity attribute mark after the semantic tag is determined can be rewritten, and the natural language (query word string) input by the user is rewritten into the keyword friendly to the search engine, so that the search result is more matched with the semantic of the natural language corresponding to the query word string, the coverage rate of the search is improved, and the efficiency and the quality of the search are also improved.
Rewrites can be divided into two categories: one is entity word and attribute word replacement and rewriting, and the other is sentence pattern replacement and rewriting.
In a preferred embodiment of the present application, the step 203 may specifically include the following sub-steps:
substep S31, searching preset identification entity words by adopting the semantic tags;
a substep S32 of replacing the entity word with a preset identification entity word;
in the embodiment of the application, the corresponding relation between the natural language query and the search engine language is established in advance for the entity words and the attribute words in the knowledge base, the corresponding relation is recorded in the translation dictionary in advance, and the entity words friendly to the search engine can be obtained by searching the translation dictionary for replacement when the entity words and the attribute words are rewritten. In particular, the translation dictionary may be stored in a knowledge base.
Since the knowledge base is based on knowledge extracted from the Internet, the webpage standard description of each entity word and each attribute word can be counted. The method comprises the steps of webpage standard description recognition, text extraction, Chinese word segmentation, entity word recognition, attribute word recognition and the like on a webpage, and the times of occurrence of each entity word and attribute word in the Internet are counted, so that the entity word and the attribute word which are friendly to a search engine in different expressions of the same entity and have the highest frequency of occurrence in the Internet are defined as the entity word and the attribute word for identifying the entity word and the attribute word, and the coverage of the entity word and the attribute word is improved. For example, the entity words "swordsmen", "Bingke" and "Miss Hibisci" are the same entity, and represent the Miss Hibisci, and the times of the entity words appearing in the internet text are counted in combination with the context, so that the frequency of use of the "Miss Hibisci" is much higher than that of the "swordsmen" and the "Miss Hibisci". Then, at this time, the friendly entity word of the search engine corresponding to the Miss Hibisci is thought to be Miss Hibisci, and the entity words of Stachys hero and Bingo in the natural language query of the user are replaced and translated into the identified entity word of Miss Hibisci.
That is, for the embodiment of the present application, the entity word may be identified as an entity word that has the same semantic tag as the entity word and is used most frequently;
and/or the presence of a gas in the gas,
a substep S33 of replacing the attribute word with a preset identification attribute word;
in the embodiment of the application, the corresponding relation between the natural language query and the search engine language can be established for the attribute words by adopting the same processing method as the entity words.
And obtaining corresponding search engine friendly keywords as identification attribute words through the use frequency of different descriptions (namely attribute words) of the same attribute corresponding to the same kind of entity in the Internet.
That is, for the embodiment of the present application, the identifying attribute word may be an attribute word that describes the same type of entity word as the attribute word and is used most frequently.
The rewriting process is a process of looking up a translation dictionary, for example, the query word string is "where schoenlein man is born", after determining the semantic label of the current entity word, the semantic expression may be "where schoenlein < entity _ person > is born < attribute _ place of birth >", by querying the translation dictionary, the identified entity word corresponding to the entity word "schoenlein" may be "hibiscus sister", and the identified attribute word corresponding to the attribute word "where is born" is "place of birth".
And/or the presence of a gas in the gas,
a substep S34 of determining whether the query word string conforms to a syntactic rule of reverse expression; if yes, go to substep S35;
a reverse expression may be opposed to a forward expression, both of which have the same semantics, being descriptions of two opposite angles to the same thing.
In a preferred embodiment of the present application, the sub-step S34 further includes the following sub-steps:
substep S341, performing syntactic analysis on the query word string to obtain a subject and a modifier, and a dependency relationship between the subject and the modifier; the dependency relationship comprises a dependency relationship that the subject depends on the modifier;
the syntactic analysis can be used for deducing the syntactic structure of a sentence according to a given syntactic prompt, and analyzing syntactic units contained in the sentence and the relationship among the syntactic units.
In specific implementation, a syntactic analysis result can be obtained through statistics, and the main analysis is three steps:
1. performing syntactic analysis and labeling on each sentence in the collected corpus by adopting a manual labeling method, and further gathering the sentences into a sentence library;
2. on the basis of the sentence library, learning to obtain a PCFG (Probabilistic Context-free Grammar) model;
3. and analyzing the sentence by adopting a PCFG model to obtain corresponding sentence components (subject, predicate, object, modified component and the like) and the dependency relationship among the components. This dependency may include a dependency of a subject on a modifier, or a dependency of a modifier on a subject.
In the substep S342, when the subject is the entity word, the modifier word is the attribute word, and the dependency relationship is a dependency relationship in which the subject depends on the modifier word, the query word string conforms to a syntactic rule of reverse expression.
At this time, the dependency relationship of the subject dependent modifier is the dependency relationship of the entity word dependent on the attribute word.
In addition, when the subject is the entity word, the modifier word is the attribute word, and the dependency relationship is a dependency relationship in which the modifier word depends on the subject, the query word string conforms to a forward expression syntax rule.
At this time, the dependency relationship of the modifier depending on the subject is the dependency relationship of the attribute word depending on the entity word. For example, the attribute word "father" in the query word string "who the father of the thank you front is" depends on the entity word "thank you front", so that "who the father of the thank you is" conforms to the syntactic rule of forward expression; and for the query word string "the son of whom the thank you are", the entity word "the thank you are" dependent on the attribute word "son", so that the "son of whom the thank you are" conforms to the syntax rule of the reverse expression. The dependency is that the current object cannot leave a certain object and exists independently in the PCFG model. For example, in the query word string "who is the parent of the thank you front", if the "parent" cannot leave the "thank you front" and exists independently, the "parent" depends on the "thank you front", and conversely, the "thank you front" may leave the "parent" and exist independently.
Substep S35, obtaining a preset expression corresponding to the syntax rule which is stored in the server and accords with the forward expression; the preset expression has use frequency;
in a specific implementation, the corresponding relationship between the forward expression and the reverse expression can be obtained by web page mining on the internet based on a knowledge base. And mining all forward expression expressions and reverse expression expressions of entity attributes in the Internet through a machine translation model based on the text pairs of the knowledge base entities and the attribute values.
And a substep S36, rewriting the query word string according to the syntactic rule of forward expression when the use frequency of the preset expression is higher than a preset threshold value.
In the embodiment of the application, the use frequency of various forward expression expressions can be counted, and the forward expression with the use frequency higher than the predictive threshold value is used as a friendly sentence pattern of the search engine.
In a specific implementation, the dependency relationship that the entity word depends on the attribute word in the query word string can be rewritten into the dependency relationship that the attribute word depends on the entity word, and then the query word string is rewritten into the query word string conforming to the syntactic rule of forward expression
For example, as shown in fig. 3, for the son who the query word string "thank you front" is, the entity word "thank you front" depends on the attribute word "son", and the relationship between the entity word and the attribute word can be seen through syntax tree analysis, and the corresponding forward expression and the corresponding frequency of use are found in the corresponding relationship table of the reverse expression and the forward expression pre-made in the knowledge base. The syntax specification for the reverse expression of this example is "< property _ person _ son > of whom entity _ person > is", and the syntax specification for the corresponding forward expression is "< property _ person _ father > of entity _ person > is". Furthermore, the identification entity word of the search engine corresponding to the entity word "thank you front" can be obtained by searching the translation dictionary, the search engine friendly word corresponding to the attribute word "< attribute _ person _ father >" can be obtained by searching the translation dictionary as "thank you front" (namely, the identification attribute word), the identification entity word and the identification attribute word are adopted to be rewritten according to the forward expression syntax rule, the final rewritten query word string is obtained as "who is the father of the thank you front", and the rewritten query word string "who is the father of the thank you front" is used to replace the original "son of the thank you front" for searching, so that the webpage related to the thank you is obtained.
Note that, the rewrite of entity words (corresponding to sub-step S31 and sub-step S32), the rewrite of attribute words (corresponding to sub-step S33), and the rewrite of sentence patterns (corresponding to sub-step S34, sub-step S35, and sub-step S36) may be used individually or in combination of two or three, and the embodiment of the present application is not limited thereto.
And step 204, searching by using the rewritten query word string to obtain matched network information.
After rewriting of the query word string is completed, retrieval and matching of network information can be performed.
As shown in fig. 4, by applying the embodiment of the present application, the query word string "who is the son of the thank you front" input by the user can be rewritten to "who is the father of the thank you front", and then the search is performed based on "who is the father of the thank you front", and compared with the search result shown in fig. 2, the information returned by the embodiment of the present application is more suitable for the user's requirement.
According to the method and the device, semantic analysis is carried out on the natural language in the query word string to obtain the semantic expression, the semantic label to which each word belongs in the semantic expression conforming to the current context is further determined, the query word string is rewritten based on the semantic label, the user intention is better met, the success rate of information matching during searching is high, the searching quality is improved, the searching efficiency is high, the user requirements are met, and the user experience is improved.
According to the method and the device, the entity words and the attribute words can be rewritten into the entity identification words and the attribute identification words which are friendly to a search engine, the query word strings which are not commonly used and are reversely expressed can be rewritten into the query word strings which are commonly used and are forwardly expressed, the coverage rate of search information of the search engine is improved, and the success rate of information matching is further improved.
To make the application better understood by those skilled in the art, an example is provided below to illustrate the specific implementation process that the embodiments of the application apply to the query word string "where the hill is".
1. And performing semantic analysis on the query word string 'where the hill is located' by combining a knowledge base, wherein the semantic analysis comprises the following steps:
and (3) entity word analysis: through inquiring an entity word list in a knowledge base, identifying that ' duchu ' is an entity word, the type (original semantic label) is ' person ' and ' place name ', and a semantic expression is ' duchu < entity _ person > < entity _ place >;
attribute word analysis, namely identifying the place of the attribute word and the type of the attribute word as the attribute word and the place of the attribute word by inquiring an attribute word list in a knowledge base, marking semantic labels and showing the place of the attribute word as attribute-place-position,
the semantic expression to which the query word string corresponds is "where < attribute _ location > < entity _ location >" is anyu < entity _ person > ".
3. And performing matching analysis by combining the semantic expression: first, syntactic analysis is carried out, and attribute words 'where' depends on the entity words 'any dune', which has two types: "person" and "place name". By checking the type consistency of the entity words and the attribute words, the common type of the attribute words "where" and the entity words "anyu" is < place >, so that the semantic label of the current entity word "anyu" is determined as "place". This can result in the result after semantic tag analysis, which is "where < attribute _ location _ position > the duel < entity _ location >;
4. rewriting the query word string according to the semantic label:
a) and querying search engine-friendly entity identifying words and attribute identifying words corresponding to the entity words and the attribute words. By searching the translation dictionary, the identification entity word 'anyu city' corresponding to the entity word 'anyu', and the identification attribute word 'geographical position' corresponding to the attribute word 'where' are obtained;
b) replacing entities and attributes in the query word string with friendly words of the search engine (namely, identifying entity words and identifying attribute words) to obtain the rewritten query word string 'ren dun city geographical position';
5. and (3) searching by using the geographical position of the Anqiu city as the rewritten query word string, and returning the result to the user.
It is to be appreciated that while for simplicity of explanation, certain example method embodiments are described as a series of acts, those skilled in the art will appreciate that the example embodiments are not limited by the order of acts described, as some steps may occur in other orders and concurrently depending on the example embodiments. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required for the embodiments of the application.
Referring to fig. 5, a block diagram of a search system according to an embodiment of the present application is shown, which may specifically include the following modules:
a part-of-speech analysis module 501, configured to perform semantic analysis on a query word string when the query word string is received, to obtain a semantic expression corresponding to the query word string;
a semantic tag determining module 502, configured to perform matching analysis in combination with the semantic expression, and determine a semantic tag to which each word in the current query word string belongs;
a rewriting module 503, configured to rewrite the query word string according to the semantic tag;
and the query module 504 is configured to search by using the rewritten query word string to obtain the matched network information.
In a preferred embodiment of the present application, the part-of-speech parsing module 501 may include the following sub-modules:
the entity word searching module is used for searching entity words corresponding to the query word string in an entity word list preset in a knowledge base;
and the attribute word searching module is used for searching the attribute words corresponding to the query word string in an attribute word list preset in a knowledge base.
In a preferred embodiment of the present application, the semantic tag determination module 502 may include the following sub-modules:
the extraction submodule is used for extracting the preset semantic tags of the attribute words;
a marking submodule for marking the entity word with one or more original semantic tags;
the incidence relation judging module is used for respectively judging whether the entity words marked with the original semantic labels have predefined incidence relation with the attribute words marked with the semantic labels; if yes, calling a determining submodule;
and the determining submodule is used for determining that the original semantic label with the predefined association relationship is the semantic label to which the current entity word belongs.
In a preferred embodiment of the present application, the rewrite module 503 may include the following sub-modules:
the identification entity word searching submodule is used for searching preset identification entity words by adopting the semantic tags;
the identification entity word replacing submodule is used for replacing the entity words with preset identification entity words;
and/or the presence of a gas in the gas,
the mark attribute word replacing submodule is used for replacing the attribute words with preset mark attribute words;
and/or the presence of a gas in the gas,
the reverse expression judging submodule is used for judging whether the query word string accords with a reverse expression syntactic rule or not; if yes, calling a preset expression obtaining submodule;
the preset expression obtaining submodule is used for obtaining a corresponding preset expression which is stored in the server and accords with the forward expression syntax rule; the preset expression has use frequency;
and the forward expression rewriting submodule is used for rewriting the query word string according to a forward expression syntactic rule when the use frequency of the preset expression is higher than a preset threshold value.
In a preferred embodiment of the present application, the identified entity word may be an entity word that has the same semantic tag as the entity word and is used most frequently;
the identification attribute words may be attribute words which describe the same type of entity words and are used most frequently.
In a preferred embodiment of the present application, the reverse expression judging sub-module further includes the following sub-modules:
the syntax analysis submodule is used for carrying out syntax analysis on the query word string to obtain a subject and a modifier and a dependency relationship between the subject and the modifier; the dependency relationship comprises a dependency relationship that the subject depends on the modifier;
and the judging submodule is used for judging that the query word string conforms to the syntactic rule of the reverse expression when the subject is the entity word, the modifier word is the attribute word and the dependency relationship is the dependency relationship of the subject on the modifier word.
For the system embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
The application is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application is preferably applied to embedded systems.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.
The above detailed description is provided for a search method and a search system, and the principles and embodiments of the present application are explained in detail by applying specific examples, and the descriptions of the above embodiments are only used to help understand the method and the core ideas of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (12)

1. A method of searching, comprising:
when a query word string is received, performing semantic analysis on the query word string to obtain a semantic expression corresponding to the query word string; performing semantic analysis on the query word string, and identifying entity words and attribute words in the query word string;
matching analysis is carried out by combining the semantic expression, and the semantic label of each word in the current query word string is determined; the method comprises the steps that a query word string for identifying entity words and attribute words is subjected to syntactic analysis based on a knowledge base to obtain the association relation between the entity words and the attribute words, and semantic labels of the entity words conforming to the current context are identified;
rewriting the query word string according to the semantic tag;
searching by the rewritten query word string to obtain matched network information;
the step of performing syntactic analysis on the query word string with the entity words and the attribute words identified based on the knowledge base to obtain the association relationship between the entity words and the attribute words comprises the following steps:
defining a grammar rule;
performing syntactic analysis on the semantic expression by using the grammar rule to obtain the association relation between entity words and attribute words in the semantic expression;
wherein the step of rewriting the query word string according to the semantic tag comprises:
judging whether the query word string accords with a syntactic rule of reverse expression; if yes, acquiring a corresponding preset expression which is stored in the server and corresponds to the syntax rule which accords with the forward expression; the preset expression has use frequency;
and when the use frequency of the preset expression is higher than a preset threshold value, rewriting the query word string according to a syntactic rule of forward expression.
2. The method according to claim 1, wherein the step of performing semantic analysis on the query word string to obtain the semantic expression corresponding to the query word string when receiving the query word string comprises:
searching entity words corresponding to the query word string in an entity word list preset in a knowledge base;
and searching the attribute words corresponding to the query word string in an attribute word list preset in a knowledge base.
3. The method of claim 2, wherein the step of determining the semantic label to which each word in the current query word string belongs comprises:
extracting preset semantic tags of the attribute words;
marking one or more original semantic labels on the entity words;
respectively judging whether the entity words marked with the original semantic tags have a predefined association relationship with the attribute words marked with the semantic tags; if so, determining that the original semantic label with the predefined association relationship is the semantic label to which the entity word belongs.
4. The method of claim 1, 2 or 3, wherein the step of rewriting the query word string according to the semantic tag further comprises:
searching preset identification entity words by adopting the semantic tags;
replacing the entity words with preset identification entity words;
and/or the presence of a gas in the gas,
and replacing the attribute words with preset identification attribute words.
5. The method of claim 4, wherein the identifying entity words are entity words having the same semantic label as the entity words and used most frequently;
the identification attribute words are attribute words which describe the same kind of entity words and are used most frequently.
6. The method of claim 4, wherein said step of determining whether said query string complies with a syntactic rule of reverse expression comprises:
performing syntactic analysis on the query word string to obtain a subject and a modifier, and a dependency relationship between the subject and the modifier; the dependency relationship comprises a dependency relationship that the subject depends on the modifier;
and when the subject is the entity word, the modifier is the attribute word, and the dependency relationship is the dependency relationship of the subject depending on the modifier, the query word string conforms to the syntactic rule of reverse expression.
7. A search system, comprising:
the part-of-speech analysis module is used for performing semantic analysis on the query word string when the query word string is received to obtain a semantic expression corresponding to the query word string; performing semantic analysis on the query word string, and identifying entity words and attribute words in the query word string;
the semantic tag determining module is used for performing matching analysis by combining the semantic expression and determining the semantic tag of each word in the current query word string; the method comprises the steps that a query word string for identifying entity words and attribute words is subjected to syntactic analysis based on a knowledge base to obtain the association relation between the entity words and the attribute words, and semantic labels of the entity words conforming to the current context are identified;
the rewriting module is used for rewriting the query word string according to the semantic label;
the query module is used for searching by using the rewritten query word string to obtain matched network information;
wherein the semantic tag determination module is further configured to:
defining a grammar rule;
performing syntactic analysis on the semantic expression by using the grammar rule to obtain the association relation between entity words and attribute words in the semantic expression;
wherein the rewrite module includes:
the reverse expression judging submodule is used for judging whether the query word string accords with a reverse expression syntactic rule or not; if yes, calling a preset expression obtaining submodule;
the preset expression obtaining submodule is used for obtaining a corresponding preset expression which is stored in the server and accords with the forward expression syntax rule; the preset expression has use frequency;
and the forward expression rewriting submodule is used for rewriting the query word string according to a forward expression syntactic rule when the use frequency of the preset expression is higher than a preset threshold value.
8. The system of claim 7, wherein the part of speech parsing module comprises:
the entity word searching module is used for searching entity words corresponding to the query word string in an entity word list preset in a knowledge base;
and the attribute word searching module is used for searching the attribute words corresponding to the query word string in an attribute word list preset in a knowledge base.
9. The system of claim 8, wherein the semantic tag determination module comprises:
the extraction submodule is used for extracting the preset semantic tags of the attribute words;
a marking submodule for marking the entity word with one or more original semantic tags;
the incidence relation judging module is used for respectively judging whether the entity words marked with the original semantic labels have predefined incidence relation with the attribute words marked with the semantic labels; if yes, calling a determining submodule;
and the determining submodule is used for determining that the original semantic label with the predefined association relationship is the semantic label to which the current entity word belongs.
10. The system of claim 7, 8 or 9, wherein the rewrite module further comprises:
the identification entity word searching submodule is used for searching preset identification entity words by adopting the semantic tags;
the identification entity word replacing submodule is used for replacing the entity words with preset identification entity words;
and/or the presence of a gas in the gas,
and the identification attribute word replacing submodule is used for replacing the attribute words with preset identification attribute words.
11. The system of claim 10, wherein the identifying entity words are entity words having the same semantic label as the entity word and used most frequently;
the identification attribute words are attribute words which describe the same kind of entity words and are used most frequently.
12. The system of claim 10, wherein the reverse expression decision sub-module comprises:
the syntax analysis submodule is used for carrying out syntax analysis on the query word string to obtain a subject and a modifier and a dependency relationship between the subject and the modifier; the dependency relationship comprises a dependency relationship that the subject depends on the modifier;
and the judgment submodule is used for ensuring that the query word string conforms to the syntactic rule of the reverse expression when the subject is the entity word, the modifier is the attribute word and the dependency relationship is the dependency relationship of the subject on the modifier.
CN201410051875.4A 2014-02-14 2014-02-14 Searching method and system Active CN104850554B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410051875.4A CN104850554B (en) 2014-02-14 2014-02-14 Searching method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410051875.4A CN104850554B (en) 2014-02-14 2014-02-14 Searching method and system

Publications (2)

Publication Number Publication Date
CN104850554A CN104850554A (en) 2015-08-19
CN104850554B true CN104850554B (en) 2020-05-19

Family

ID=53850201

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410051875.4A Active CN104850554B (en) 2014-02-14 2014-02-14 Searching method and system

Country Status (1)

Country Link
CN (1) CN104850554B (en)

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105138631B (en) * 2015-08-20 2019-10-11 小米科技有限责任公司 The construction method and device of knowledge base
CN107203548A (en) * 2016-03-17 2017-09-26 阿里巴巴集团控股有限公司 Attribute acquisition methods and device
CN106294638B (en) * 2016-08-02 2020-01-14 百度在线网络技术(北京)有限公司 Auxiliary decision making method and device
CN106227876B (en) * 2016-08-02 2020-03-10 百度在线网络技术(北京)有限公司 Activity arrangement aided decision-making method and device
CN106528676B (en) * 2016-10-31 2019-09-03 北京百度网讯科技有限公司 Entity Semantics search processing method and device based on artificial intelligence
WO2018211670A1 (en) * 2017-05-18 2018-11-22 三菱電機株式会社 Search device, tag generator, query generator, secret search system, search program, tag generation program, and query generation program
US11132408B2 (en) * 2018-01-08 2021-09-28 International Business Machines Corporation Knowledge-graph based question correction
CN108256070B (en) * 2018-01-17 2022-07-15 北京百度网讯科技有限公司 Method and apparatus for generating information
CN108388650B (en) * 2018-02-28 2022-11-04 百度在线网络技术(北京)有限公司 Search processing method and device based on requirements and intelligent equipment
CN108959257B (en) * 2018-06-29 2022-11-22 北京百度网讯科技有限公司 Natural language parsing method, device, server and storage medium
CN109558479B (en) * 2018-11-29 2022-12-02 出门问问创新科技有限公司 Rule matching method, device, equipment and storage medium
CN109684448B (en) * 2018-12-17 2021-01-12 北京北大软件工程股份有限公司 Intelligent question and answer method
CN109684357B (en) * 2018-12-21 2021-03-19 上海智臻智能网络科技股份有限公司 Information processing method and device, storage medium and terminal
CN109857853B (en) * 2019-01-28 2021-09-14 掌阅科技股份有限公司 Searching method based on electronic book, electronic equipment and computer storage medium
CN111666479A (en) * 2019-03-06 2020-09-15 富士通株式会社 Method for searching web page and computer readable storage medium
CN113919360A (en) * 2020-07-09 2022-01-11 阿里巴巴集团控股有限公司 Semantic understanding method, voice interaction method, device, equipment and storage medium
CN113807102B (en) * 2021-08-20 2022-11-01 北京百度网讯科技有限公司 Method, device, equipment and computer storage medium for establishing semantic representation model
CN113868312A (en) * 2021-10-13 2021-12-31 上海市研发公共服务平台管理中心 Multi-method fused mechanism matching method, device, equipment and storage medium
CN115576435B (en) * 2022-12-12 2023-04-04 深圳市人马互动科技有限公司 Intention processing method and related device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1628298A (en) * 2002-05-28 2005-06-15 弗拉迪米尔·叶夫根尼耶维奇·涅博利辛 Method for synthesising self-learning system for knowledge acquistition for retrieval systems
US7840547B1 (en) * 2004-03-31 2010-11-23 Google Inc. Methods and systems for efficient query rewriting
CN102117285A (en) * 2009-12-30 2011-07-06 安世亚太科技(北京)有限公司 Search method based on semantic indexing
CN102236664A (en) * 2010-04-28 2011-11-09 百度在线网络技术(北京)有限公司 Retrieval system, retrieval method and information processing method based on semantic normalization
CN102622342A (en) * 2011-01-28 2012-08-01 上海肇通信息技术有限公司 Interlanguage system and interlanguage engine and interlanguage translation system and corresponding method
CN103425714A (en) * 2012-05-25 2013-12-04 北京搜狗信息服务有限公司 Query method and system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1628298A (en) * 2002-05-28 2005-06-15 弗拉迪米尔·叶夫根尼耶维奇·涅博利辛 Method for synthesising self-learning system for knowledge acquistition for retrieval systems
US7840547B1 (en) * 2004-03-31 2010-11-23 Google Inc. Methods and systems for efficient query rewriting
CN102117285A (en) * 2009-12-30 2011-07-06 安世亚太科技(北京)有限公司 Search method based on semantic indexing
CN102236664A (en) * 2010-04-28 2011-11-09 百度在线网络技术(北京)有限公司 Retrieval system, retrieval method and information processing method based on semantic normalization
CN102622342A (en) * 2011-01-28 2012-08-01 上海肇通信息技术有限公司 Interlanguage system and interlanguage engine and interlanguage translation system and corresponding method
CN103425714A (en) * 2012-05-25 2013-12-04 北京搜狗信息服务有限公司 Query method and system

Also Published As

Publication number Publication date
CN104850554A (en) 2015-08-19

Similar Documents

Publication Publication Date Title
CN104850554B (en) Searching method and system
WO2019227710A1 (en) Network public opinion analysis method and apparatus, and computer-readable storage medium
US8972413B2 (en) System and method for matching comment data to text data
KR101508260B1 (en) Summary generation apparatus and method reflecting document feature
CN109145216A (en) Network public-opinion monitoring method, device and storage medium
CN109947952B (en) Retrieval method, device, equipment and storage medium based on English knowledge graph
CN106844640B (en) Webpage data analysis processing method
CN103678576A (en) Full-text retrieval system based on dynamic semantic analysis
CN107102993B (en) User appeal analysis method and device
US9569525B2 (en) Techniques for entity-level technology recommendation
CN110765761A (en) Contract sensitive word checking method and device based on artificial intelligence and storage medium
CN106682209A (en) Cross-language scientific and technical literature retrieval method and cross-language scientific and technical literature retrieval system
US20190384856A1 (en) Description matching for application program interface mashup generation
CN105912662A (en) Coreseek-based vertical search engine research and optimization method
US11755677B2 (en) Data mining method, data mining apparatus, electronic device and storage medium
Wang et al. Automatic tagging of cyber threat intelligence unstructured data using semantics extraction
CN112597768B (en) Text auditing method, device, electronic equipment, storage medium and program product
Kumar Apache Solr search patterns
WO2019163642A1 (en) Summary evaluation device, method, program, and storage medium
CN106033444B (en) Text content clustering method and device
Barkschat Semantic information extraction on domain specific data sheets
KR102298397B1 (en) Citation Relationship Analysis Method and System Based on Citation Type
KR101238927B1 (en) Electronic book contents searching service system and electronic book contents searching service method
CN111368036B (en) Method and device for searching information
CN114647739A (en) Entity chain finger method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant