CN110209765B - Method and device for searching keywords according to meanings - Google Patents

Method and device for searching keywords according to meanings Download PDF

Info

Publication number
CN110209765B
CN110209765B CN201910433774.6A CN201910433774A CN110209765B CN 110209765 B CN110209765 B CN 110209765B CN 201910433774 A CN201910433774 A CN 201910433774A CN 110209765 B CN110209765 B CN 110209765B
Authority
CN
China
Prior art keywords
word
probability
matching result
searched
context information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910433774.6A
Other languages
Chinese (zh)
Other versions
CN110209765A (en
Inventor
程波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Greenet Information Service Co Ltd
Original Assignee
Wuhan Greenet Information Service Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Greenet Information Service Co Ltd filed Critical Wuhan Greenet Information Service Co Ltd
Priority to CN201910433774.6A priority Critical patent/CN110209765B/en
Publication of CN110209765A publication Critical patent/CN110209765A/en
Application granted granted Critical
Publication of CN110209765B publication Critical patent/CN110209765B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of semantic search, and provides a method and a device for searching keywords according to the semantic meaning. Splitting context information content in the initial matching result according to a preset splitting rule to obtain at least two groups of entry objects; acquiring a corresponding word skipping probability table according to the attribute information of the target object to be searched; searching the word skip probability table according to the sequence of each vocabulary entry contained in each group of vocabulary entry objects to obtain the establishment probability of each group of vocabulary entry objects; and screening the initial matching result according to the establishment probability of each group of entries to obtain the screened matching result. The semantic judgment method adopted by the invention has simple and clear logic and high accuracy after long-time verification.

Description

Method and device for searching keywords according to meanings
[ technical field ] A method for producing a semiconductor device
The invention relates to the technical field of semantic search, in particular to a method and a device for searching keywords according to the semantic meaning.
[ background of the invention ]
In internet applications and traffic monitoring projects, there are often scenes of searching keywords, such as financial news, and if the content contains a name of a certain stock or fund, the current price is automatically displayed behind the name; for another example, in a flow monitoring project, a web page containing a certain keyword needs to be blocked, and in these tasks, keyword search needs to be performed on content.
However, merely based on the binary information of the character and not on the semantic meaning of the character, some unexpected results may be brought, for example, to block a web page containing "chinese" binary words, a science fiction novel containing the following fields will be blocked, "the concept of the country in the constantan civilization does not exist at all", which is obviously not desired by people who have the blocking policy, and there is also a case that since the search of the character is essentially a comparison of binary data in a computer, binary data corresponding to a keyword is found in traffic, which may be only a coincidence, for example, a hit part is only a shaping number and does not represent a character, and if the calculation is hit, unexpected problems may be brought. For another example, an article on a certain comprehensive website contains the following field "continuously increasing the proportion of low-grade products in industrial and agricultural products", often, three characters of "agricultural products" in the article are highlighted, and the article is followed by the market of a stock called "agricultural products", which is obviously not suitable. The above examples are many, and the root is that the semantics of the keywords are not considered when searching for the keywords. Certainly, a mature word segmentation method can be adopted to segment words of the whole article, and then keywords are searched in all the segmented words, so that the semantics is correct, the implementation is complex, and the efficiency is extremely low.
In view of the above, overcoming the drawbacks of the prior art is an urgent problem in the art.
[ summary of the invention ]
The technical problem to be solved by the invention is that the method for searching keywords in the prior art is easy to find the result with inconsistent semantics, and although the improved search method based on semantics is adopted, the method is complex to implement and has low efficiency.
The technical problem to be further solved by the invention is how to more effectively identify the target search result in the environment of large data analysis.
The invention adopts the following technical scheme:
in a first aspect, the present invention provides a method for semantically searching for a keyword, which obtains a keyword to be searched and traffic data of each target object to be searched, and obtains an initial matching result by matching the keyword to be searched and the traffic data, wherein the initial matching result includes context information content corresponding to the keyword to be searched in each traffic data, and includes:
splitting the context information content in the initial matching result according to a preset splitting rule to obtain at least two groups of entry objects;
acquiring a corresponding word skipping probability table according to the attribute information of the target object to be searched;
searching the word skip probability table according to the sequence of each vocabulary entry contained in each group of vocabulary entry objects to obtain the establishment probability of each group of vocabulary entry objects;
and screening the initial matching result according to the establishment probability of each group of entries to obtain the screened matching result.
Preferably, inThe keyword to be searched is X1,X2,…,Xn-1,XnWherein X isiRepresents a character, i ∈ [1, n ]](ii) a The preset splitting rule specifically comprises the following steps:
splitting the context information content in the matching result according to at least two splitting modes to obtain at least two groups of vocabulary entry objects; wherein, the split mode includes:
the first splitting mode: in the context information content, matching in the word stock is performed by X1The entry formed by the previous character is marked as W if matching2If not, then X1Is recognized as a word, denoted as W2(ii) a Is located at W in the context information content2Continue to look for a word before, denoted as W1(ii) a Wherein, X2,…,Xn-1,XnIs marked as W3In the context information content, at X2,…,Xn-1,XnThen find a word, denoted as W4(ii) a At this point, a set of entry objects, denoted W, is obtained1W2W3W4
And a second splitting mode: in the context information content, at X1,X2,…,Xn-1Look for a word before, denoted C1(ii) a Said X1,X2,…,Xn-1Is marked as C2(ii) a To XnMaking backward combination matching, finding out the longest matched word, and marking as C3At C3Then continue to find a word backwards, recorded as C4(ii) a At this time, a set of entry objects, denoted C, is obtained1C2C3C4
A third splitting mode: handle X1,X2,…,Xn-1,XnAs a word, it is marked as M2(ii) a In the context information content, at X1Look for a word before, denoted M1(ii) a In the context information content, at XnThen find two words, marked as M3And M4(ii) a At this time, a set of entry objects, denoted as M, is obtained1M2M3M4
The splitting mode is four: handle X1,X2,…,Xn-1,XnAs a word, is denoted as N3(ii) a In the context information content, at X1Look for two words before, denoted N1And N2(ii) a In the context information content, at XnThen, a word is searched and marked as N4(ii) a At this time, a set of entry objects, denoted as N, is obtained1N2N3N4
Preferably, said site is at X1Previously looking for a word or the position XnThen, a word is searched, and the specific implementation is as follows:
in the context information content, corresponding to the initial reference object when searching, the lengths of continuous characters are increased one by one and are matched with a word stock; until the matching result is not obtained, the continuous characters with the length of the previous round are determined as the characters positioned at X1Previously looking for a word or the position XnThen searching a word;
wherein the starting reference object comprises the X1Or said Xn
Preferably, the initial matching result is screened according to the establishment probability of each group of entries to obtain the screened matching result, and the method specifically includes:
if M is1M2M3M4Or N1N2N3N4Probability of less than W1W2W3W4And/or C1C2C3C4Removing the corresponding target object from the initial matching result;
if M is1M2M3M4Or N1N2N3N4Probability of (1) is greater than or equal to W1W2W3W4And/or C1C2C3C4The probability value of (2) is then the target object is retained in the screened matching result.
Preferably, if the process of obtaining the initial matching result and the process of obtaining the filtered matching result are executed in parallel, the method further includes:
analyzing and obtaining a distribution map of each target object according to the attribute information of each target object contained in the screened matching result; wherein the area of the map is calibrated by the attribute information;
calculating subsequent M for first attribute information in which the ratio of the number of target objects in a certain area exceeds a preset threshold1M2M3M4Or N1N2N3N4The probability of (2) is increased by a weighted value, so that the target object belonging to the first attribute information has higher probability of passing the screening.
Preferably, when the target object to be searched is a web page, the attribute information of the target object to be searched specifically includes one or more items of a website topic type, a web page title content, and a web page text classification.
Preferably, the topic type of the website comprises one or more of news, finance, sports, entertainment and synthesis;
the webpage text classification comprises one or more items of a dispersed text, a narrative text and a comprehensive text.
Preferably, the word hop probability table specifically includes:
analyzing the flow data of the potential target object through the big data, and obtaining the part of speech of each entry in the corresponding flow data according to a word bank matching mode; wherein, the part of speech includes one or more items of nouns, verbs, adjectives, adverbs, prepositions, sentence heads, sentence tails and punctuation marks;
wherein, the jump probability table records the probability of completing the corresponding forward and backward sequence jump among the vocabulary entries corresponding to each part of speech.
Preferably, the matching the keyword to be searched and the flow data to obtain an initial matching result specifically includes:
and converting the keywords into codes to be searched of UFT-8, GB2312 and/or BIG5, and matching the flow data of the target object to be searched one by one through the codes to be searched to obtain an initial matching result.
Preferably, when the number of words of the keyword exceeds a preset value, before the splitting of the context information content in the initial matching result according to a preset splitting rule is executed to obtain at least two sets of entry objects, the method includes:
matching to obtain part-of-speech combinations of the keywords according to a word bank;
and obtaining the weighted value in the probability calculation process of each group of vocabulary item objects corresponding to each attribute information according to the part of speech combination.
In a second aspect, the present invention further provides an apparatus for semantic searching keywords, which is used to implement the method for semantic searching keywords in the first aspect, and the apparatus includes:
at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor and programmed to perform the method of semantically searching for keywords according to the first aspect.
In a third aspect, the present invention also provides a non-transitory computer storage medium storing computer-executable instructions for execution by one or more processors for performing the method for semantic searching for keywords according to the first aspect.
Compared with the prior art, the method for judging the semantics has the advantages that the logic is simple and clear, and the accuracy is high after long-time verification.
The traditional method is to perform word segmentation on the whole article or the whole sentence and then search in all word sets. The invention adopts the technical scheme that keywords are analyzed in advance, binary matching is carried out to search the keywords, and whether the searched content meets the overall execution flow of semantics is judged, so that the efficiency is higher; where the performance penalty depends primarily on the key hit rate.
In the preferred scheme of the invention, in the searching process, the attribute information of the searched target object is also dynamically collected and sorted, so that a weighted value with more referential meaning is provided for the subsequent calculation process, and the searching accuracy is further improved.
[ description of the drawings ]
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below. It is obvious that the drawings described below are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.
Fig. 1 is a schematic flowchart of a method for searching keywords semantically according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating the effect of rendering context in the initial matching result according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a splitting manner according to an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of another splitting method provided by the embodiment of the invention;
FIG. 5 is a schematic structural diagram of another splitting method provided by an embodiment of the present invention;
fig. 6 is a schematic structural diagram of another splitting manner provided in the embodiment of the present invention;
FIG. 7 is a schematic diagram of a probability solution of a splitting manner according to an embodiment of the present invention;
fig. 8 is a flowchart illustrating a method for using a part-of-speech weighted value of a long keyword according to an embodiment of the present invention;
fig. 9 is a schematic structural diagram of an apparatus for searching keywords semantically according to an embodiment of the present invention.
[ detailed description ] embodiments
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In the description of the present invention, the terms "inner", "outer", "longitudinal", "lateral", "upper", "lower", "top", "bottom", and the like indicate orientations or positional relationships based on those shown in the drawings, and are for convenience only to describe the present invention without requiring the present invention to be necessarily constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention.
In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
Before implementing embodiments of the present invention, it is usually required to perform some conventional operations in retrieving keywords, such as: the method comprises the steps of obtaining keywords to be searched and flow data of target objects to be searched, and obtaining an initial matching result by matching the keywords to be searched and the flow data, wherein the initial matching result comprises context information content corresponding to the keywords to be searched in the flow data.
In the embodiment of the present invention, the traffic data of the target object to be searched may be represented by various web portals and web page contents that can be acquired in the internet, and various media contents that are acquired through internet channels and contain a text expression form.
The matching of the keywords to be searched and the flow data can be realized by using the existing related search matching algorithm, and an implementation mode preferred in the invention will be specifically described in embodiment 1 of the invention; the core point of the invention is that after the initial matching process is completed and the context information content corresponding to the keyword to be searched in each flow data is obtained, how to discriminate which matching results are more in line with the search intention and which are not in line with the search intention through semantic analysis so as to obtain the screened matching results and further reduce the time waste of the matching result browser on the meaningless matching results.
Example 1:
an embodiment 1 of the present invention provides a method for semantically searching for a keyword, where based on the initial matching result obtained, as shown in fig. 1, the method includes:
in step 201, the context information content in the initial matching result is split according to a preset splitting rule, so as to obtain at least two groups of vocabulary entry objects.
In the embodiment of the invention, a shallow definition is given to the preset splitting rule; that is, for any keyword composed of one or more characters, the preliminary splitting is performed according to the following three ways: 1. split into "head character" + "remainder character"; 2. split into "remainder character" + "tail character"; 3. The split is not performed and the 'complete character' is retained. And then, in the second step of splitting, combining the context information content of the keywords and the sowing preliminary splitting result to generate entry objects with a uniform format. The term object combination may include 2 terms, 3 terms, 4 terms, and so on. However, through experimental verification, 2 entries and 3 entries cannot effectively restore the word skipping characteristics in the context information content where the keyword is located, and if the word skipping characteristics exceed 4 entries, part of the word skipping characteristics are far away from the keyword due to the fact that the incidence relation of the keyword is far, the effect of the word on the judgment of the keyword semantics is smaller. Therefore, it is preferable in the embodiment of the present invention that the combination of entry objects is formed by using 4 entries. The implementation of the corresponding entry object combination will be specifically explained in the following embodiments of the present invention.
In step 202, a corresponding word hop probability table is obtained according to the attribute information of the target object to be searched.
And when the target object to be searched is a webpage, the attribute information of the target object to be searched is specifically one or more of a website topic type, a webpage title content and a webpage text classification. The website topic type comprises one or more of news, finance, sports, entertainment and synthesis; the webpage text classification comprises one or more items of a dispersed text, a narrative text and a comprehensive text. For example: according to the URL location to the type of information (prose, narrative, synthesis, etc.), the word jump probability table of the corresponding classification is selected.
As can be seen from the above example, the attribute information of the target object to be searched is also one of the bases for generating the word jump probability table in the embodiment of the present invention, and in the subsequent embodiments of the present invention, several typical contents of the word jump probability table will be specifically shown.
In step 203, the word jump probability table is searched according to the order of each vocabulary entry included in each set of vocabulary entry objects, and the establishment probability of each set of vocabulary entry objects is obtained.
The purpose of this step is to analyze the probability that the current different splitting modes are correspondingly established through a word jump probability table obtained by analyzing historical big data. In different splitting modes, only if the entry object combination obtained by the initial splitting wins in the probability calculation result, the corresponding flow data of the target object entering the initial matching result is shown to be the object which is consistent with the input keyword of the user in the semanteme.
In step 204, the initial matching result is filtered according to the establishment probability of each group of entries, so as to obtain the filtered matching result.
Wherein, each group of entries is the entry object combination obtained by different splitting modes.
Compared with the prior art, the method for judging the semantics has the advantages that the logic is simple and clear, and the accuracy is high after long-time verification.
In the following, how to implement the term combination is described according to the above-analyzed configuration form of the term object combination with 4 terms. Taking the keyword to be searched as X1,X2,…,Xn-1,XnFor example, wherein XiRepresents a character, i ∈ [1, n ]](ii) a The preliminary steps involved in step 201 in example 1Setting a splitting rule (wherein the described preliminary splitting content is merged), specifically including:
the context information content in the matching result is divided according to at least two of the following dividing modes as shown in fig. 2, wherein the keyword is contained in the context information content, so as to obtain at least two groups of vocabulary entry objects; wherein, the split mode includes:
the first splitting mode: in the context information content, matching in the word stock is performed by X1The entry formed by the previous character is marked as W if matching2If not, then X1Is recognized as a word, denoted as W2(ii) a Is located at W in the context information content2Continue to look for a word before, denoted as W1(ii) a Wherein, X2,…,Xn-1,XnIs marked as W3In the context information content, at X2,…,Xn-1,XnThen find a word, denoted as W4(ii) a At this point, a set of entry objects, denoted as W as shown in FIG. 3, is obtained1W2W3W4
And a second splitting mode: in the context information content, at X1,X2,…,Xn-1Look for a word before, denoted C1(ii) a Said X1,X2,…,Xn-1Is marked as C2(ii) a To XnMaking backward combination matching, finding out the longest matched word, and marking as C3At C3Then continue to find a word backwards, recorded as C4(ii) a At this point, a set of entry objects, denoted C as shown in FIG. 4, is obtained1C2C3C4
A third splitting mode: handle X1,X2,…,Xn-1,XnAs a word, it is marked as M2(ii) a In the context information content, at X1Look for a word before, denoted M1(ii) a In the context information content, at XnThen find two words, marked as M3And M4(ii) a At this point, a set of entry objects, denoted M as shown in FIG. 5, is obtained1M2M3M4
The splitting mode is four: handle X1,X2,…,Xn-1,XnAs a word, is denoted as N3(ii) a In the context information content, at X1Look for two words before, denoted N1And N2(ii) a In the context information content, at XnThen, a word is searched and marked as N4(ii) a At this point, a set of entry objects, denoted N as shown in FIG. 6, is obtained1N2N3N4
Wherein said site is X1Previously looking for a word or the position XnThen, a word is searched, and the specific implementation is as follows:
in the context information content, corresponding to the initial reference object when searching, the lengths of continuous characters are increased one by one and are matched with a word stock; until the matching result is not obtained, the continuous characters with the length of the previous round are determined as the characters positioned at X1Previously looking for a word or the position XnThen searching a word;
wherein the starting reference object comprises the X1Or said Xn
It is emphasized that the site X1Previously looking for a word or the position XnThen, a word is sought, which is only one of the different ways of splitting described above, for example, located at X1Previously, a word was sought, which in the different splitting modes described above is also denoted as "located at W in the context information content2The search continues before "and in a concrete split mode, the operation of searching for a word also exists in the case of containing X1By itself, e.g. "consisting of X1The entry formed by the previous character is marked as W if matching2". However, in any form, the basic principle can adopt the implementation given above, that is, "in the context information content, the lengths of the consecutive characters are increased one by one corresponding to the initial reference object when searching, and are matched with the lexicon; until the matching result is not obtained, the continuous words with the length of the previous round are determinedIs marked by the position X1Previously looking for a word or the position XnAnd then look for a word ".
Further, with reference to the above-mentioned example of the entry object combination, it is further seen that the initial matching result is screened according to the establishment probability of each group of entries related in step 203 in embodiment 1, so as to obtain a screened matching result, and the specific implementation content is represented as:
if M is1M2M3M4Or N1N2N3N4Probability of less than W1W2W3W4And/or C1C2C3C4Removing the corresponding target object from the initial matching result; as shown in FIG. 7, M is shown1M2M3M4Schematic diagram for calculating probability, wherein probability value is P1*P2*P3Wherein P is1Means from M1Jump to M of part of speech2Probability of belonging part of speech, P2Means from M2Jump to M of part of speech3Probability of belonging part of speech, P3Means from M3Jump to M of part of speech4Probability of belonging part of speech, and corresponding P1、P2And P3The parameter value of (2) can be obtained by looking up a jump probability table.
If M is1M2M3M4Or N1N2N3N4Probability of (1) is greater than or equal to W1W2W3W4And/or C1C2C3C4The probability value of (2) is then the target object is retained in the screened matching result.
Considering an implementation situation, when the retrieved target traffic data is large, the preferred operation mode is a process of obtaining an initial matching result and a process of obtaining a filtered matching result, and parallel execution processes are adopted, and then the method further includes:
analyzing and obtaining a distribution map of each target object according to the attribute information of each target object contained in the screened matching result; wherein the area of the map is calibrated by the attribute information;
calculating subsequent M for first attribute information in which the ratio of the number of target objects in a certain area exceeds a preset threshold1M2M3M4Or N1N2N3N4The probability of (2) is increased by a weighted value, so that the target object belonging to the first attribute information has higher probability of passing the screening. In order to improve the use effect of the weighted value, the identification of the target object in the area can be completed by an operator; therefore, the determination manner of "the ratio of the number of target objects exceeds the preset threshold" may be replaced with "the number of times of the target object is determined to be incorrect is less than the preset threshold". The preset threshold may be set according to experience, and the experience is also determined according to the total analyzed flow data of the target object to be searched.
In the preferred scheme of the invention, in the searching process, the attribute information of the searched target object is also dynamically collected and sorted, so that a weighted value with more referential meaning is provided for the subsequent calculation process, and the searching accuracy is further improved.
In this embodiment of the present invention, the word jump probability table specifically includes:
analyzing the flow data of the potential target object through the big data, and obtaining the part of speech of each entry in the corresponding flow data according to a word bank matching mode; wherein, the part of speech includes one or more items of nouns, verbs, adjectives, adverbs, prepositions, sentence heads, sentence tails and punctuation marks;
wherein, the jump probability table records the probability of completing the corresponding forward and backward sequence jump among the vocabulary entries corresponding to each part of speech.
For example, the general hop probability table used when the attribute information cannot be determined is as follows:
Figure GDA0002890223070000101
Figure GDA0002890223070000111
then for prose, the average period is shorter and the punctuation marks are more, such as "day, blue, heart, gray. ", the hop probability is schematically as follows:
p (adjective->Noun) 0.81
P (period->Noun) 0.88
P (period->Adjective word) 0.21
P (verb->Noun) 0.72
P (verb->Adjective word) 0.19
P (preposition->Adjective word) 0.55
P (preposition->Punctuation mark) 0.10
P (noun->Punctuation mark) 0.66
P (punctuation mark->Noun) 0.91
P (punctuation mark->Preposition word) 0.80
Comparing the two, it can be clearly seen that the probability of "punctuation- > preposition" in the prose is strengthened to reach 0.80, which is only represented as 0.30 in the general jump probability table. Other values of the probability parameter, which are presented as examples, cannot be expressed as true values; the probability values of different jump modes can be calculated by statistics of semantic analysis in the existing traffic data, namely, the ratio of the occurrence times of different jump types to the total jump occurrence times in the total traffic data.
In the embodiment of the present invention, since an implementation scheme of first matching and then analyzing semantics is adopted, compared with a manner of splitting traffic data according to semantics and then performing matching in the prior art, the present invention further specially provides a method for completing preliminary matching, where matching the keyword to be searched and the traffic data to obtain an initial matching result specifically includes:
and converting the keywords into codes to be searched of UFT-8, GB2312 and/or BIG5, and matching the flow data of the target object to be searched one by one through the codes to be searched to obtain an initial matching result.
The traditional method is to perform word segmentation on the whole article or the whole sentence and then search in all word sets. The invention adopts the technical scheme that keywords are analyzed in advance, binary matching is carried out to search the keywords, and whether the searched content meets the overall execution flow of semantics is judged, so that the efficiency is higher; where the performance penalty depends primarily on the key hit rate.
The keywords for searching set forth in the above contents of the embodiment of the present invention generally mean that the keywords themselves have entry characteristics, and in an actual situation, the expression form of the keywords may also be entry combinations, even sentences and the like, and at this time, the keywords have a part-of-speech combination characteristic; as can be known from practice, in the traffic data of different attribute information, the proportions of different parts of speech combinations are greatly different, so that, in combination with the embodiment of the present invention, there is a possible improvement, as shown in fig. 8, when the number of words of the keyword exceeds a preset value (i.e. the keyword is not formed by a single entry by default), before the splitting of the context information content in the initial matching result according to a preset splitting rule is performed to obtain at least two sets of entry objects, the method includes:
in step 301, a part-of-speech combination for the keyword is obtained by matching according to a lexicon.
In the embodiment of the present invention, the function of the word stock at least includes determining a part of speech based on matching, determining that the word is a complete entry based on matching, determining the existence probability of each entry when two or more entries are satisfied simultaneously based on matching, and the like. Wherein, the existence probability of each entry is determined according to the matching when two or more entries are satisfied at the same time, and the method is particularly suitable for the embodiment of the invention when the step of1Previously looking for a word or the position XnWhen looking for a word "later, the situation of the invention is possible. In particular, when the condition for setting search completion is not matched, and the condition for completing search on one match does not occur when two or more entries are satisfied at the same time.
In step 302, according to the part of speech combination, a weighted value in the process of calculating the probability of each group of vocabulary item objects corresponding to each attribute information is obtained.
The keywords are simply split, so that most scenes with ambiguous semantics can be solved; in general, the keyword itself set by the user is a whole (a word or a sentence) and has independent and complete meaning, but in few cases, the first character of the keyword is a part of other words, or the last character is a part of other words, such as the keyword "china", and then the person in the search strategy must consider "china" as a concept of a country, but in this case: "the concept of the country in this stellar culture does not exist at all" but "china" is not a word and is less likely to be the concept of the country. In rare cases, the front 2 words or the rear 2 words of the keywords belong to other words, and only a common scene needs to be considered, so that the logic is simple and easy to implement, and the performance loss is low.
In addition, the invention does not need to carry out semantic analysis on the whole article or the whole sentence, and only determines the combination of the maximum probability according to the word attributes marked in the word stock; meanwhile, probability value weighting is carried out by combining the length of the keyword and the occurrence frequency of the keyword in the whole information, so that a very high semantic accuracy can be obtained;
different information classifications have different word jump probability tables, since the semantic analysis is not performed on the whole article or sentence, the accuracy of the word jump probability table is required to be improved as much as possible.
Example 2:
fig. 9 is a schematic structural diagram of a semantic keyword searching apparatus based on human body status according to an embodiment of the present invention. The semantic search keyword device based on the human body state of the present embodiment includes one or more processors 21 and a memory 22. In fig. 9, one processor 21 is taken as an example.
The processor 21 and the memory 22 may be connected by a bus or other means, and fig. 9 illustrates the connection by a bus as an example.
The memory 22, as a non-volatile computer-readable storage medium for a method and apparatus for semantic searching for keywords, may be used to store non-volatile software programs and non-volatile computer-executable programs, such as the method for semantic searching for keywords in example 1. The processor 21 performs a method of semantically searching for keywords by executing a non-volatile software program and instructions stored in the memory 22.
The memory 22 may include high speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the memory 22 may optionally include memory located remotely from the processor 21, and these remote memories may be connected to the processor 21 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The program instructions/modules are stored in the memory 22 and, when executed by the one or more processors 21, perform the method of semantically searching for keywords in embodiment 1 above, e.g., perform the various steps shown in fig. 1 and/or fig. 7 described above.
It should be noted that, for the information interaction, execution process and other contents between the modules and units in the apparatus and system, the specific contents may refer to the description in the embodiment of the method of the present invention because the same concept is used as the embodiment of the processing method of the present invention, and are not described herein again.
Those of ordinary skill in the art will appreciate that all or part of the steps of the various methods of the embodiments may be implemented by associated hardware as instructed by a program, which may be stored on a computer-readable storage medium, which may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (9)

1. A method for searching keywords according to meanings obtains keywords to be searched and flow data of target objects to be searched, and obtains an initial matching result by matching the keywords to be searched and the flow data, wherein the initial matching result comprises context information content corresponding to the keywords to be searched in the flow data, and the method is characterized by comprising the following steps:
splitting the context information content in the initial matching result according to a preset splitting rule to obtain at least two groups of entry objects;
acquiring a corresponding word skipping probability table according to the attribute information of the target object to be searched;
searching the word skip probability table according to the sequence of each vocabulary entry contained in each group of vocabulary entry objects to obtain the establishment probability of each group of vocabulary entry objects;
screening the initial matching result according to the establishment probability of each group of entries to obtain a screened matching result;
the keyword to be searched is X1,X2,…,Xn-1,XnWherein X isiRepresents a character, i ∈ [1, n ]](ii) a The preset splitting rule specifically comprises the following steps:
splitting the context information content in the matching result according to at least two splitting modes to obtain at least two groups of vocabulary entry objects; wherein, the split mode includes:
the first splitting mode: in the context information content, matching in the word stock is performed by X1The entry formed by the previous character is marked as W if matching2If not, then X1Is recognized as a word, denoted as W2(ii) a Is located at W in the context information content2Continue to look for a word before, denoted as W1(ii) a Wherein, X2,…,Xn-1,XnIs marked as W3In the context information content, at X2,…,Xn-1,XnThen find a word, denoted as W4(ii) a At this point, a set of entry objects, denoted W, is obtained1W2W3W4
And a second splitting mode: in the context information content, at X1,X2,…,Xn-1Look for a word before, denoted C1(ii) a Said X1,X2,…,Xn-1Is marked as C2(ii) a To XnMaking backward combination matching, finding out the longest matched word, and marking as C3At C3Then continue to find a word backwards, recorded as C4(ii) a At this time, a set of entry objects, denoted C, is obtained1C2C3C4
A third splitting mode: handle X1,X2,…,Xn-1,XnAs a word, it is marked as M2(ii) a In the context information content, at X1Look for a word before, denoted M1(ii) a In the context information content, at XnThen find two words, marked as M3And M4(ii) a At this time, a set of entry objects, denoted as M, is obtained1M2M3M4
The splitting mode is four: handle X1,X2,…,Xn-1,XnAs a word, is denoted as N3(ii) a In the context information content, at X1Look for two words before, denoted N1And N2(ii) a In the context information content, at XnThen, a word is searched and marked as N4(ii) a At this time, a set of entry objects, denoted as N, is obtained1N2N3N4
2. The method for semantic search of keywords according to claim 1, wherein the position X is1Previously looking for a word or the position XnThen, a word is searched, and the specific implementation is as follows:
in the context information content, corresponding to the initial reference object when searching, the lengths of continuous characters are increased one by one and are matched with a word stock; until the matching result is not obtained, the continuous characters with the length of the previous round are determined as the characters positioned at X1Previously looking for a word or the position XnThen searching a word;
wherein the starting reference object comprises the X1Or said Xn
3. The method of claim 1, wherein the step of screening the initial matching result according to the probability of occurrence of each set of entries to obtain a screened matching result comprises:
if M is1M2M3M4Or N1N2N3N4Probability of less than W1W2W3W4And/or C1C2C3C4Removing the corresponding target object from the initial matching result;
if M is1M2M3M4Or N1N2N3N4Probability of (1) is greater than or equal to W1W2W3W4And/or C1C2C3C4The probability value of (2) is then the target object is retained in the screened matching result.
4. The method for semantic search of keywords according to claim 1, wherein the process of obtaining the initial matching results and the process of obtaining the filtered matching results are executed in parallel, and the method further comprises:
analyzing and obtaining a distribution map of each target object according to the attribute information of each target object contained in the screened matching result; wherein the area of the map is calibrated by the attribute information;
calculating subsequent M for first attribute information in which the ratio of the number of target objects in a certain area exceeds a preset threshold1M2M3M4Or N1N2N3N4The probability of (2) is increased by a weighted value, so that the target object belonging to the first attribute information has higher probability of passing the screening.
5. The method for searching keywords semantically according to claim 1, wherein when the target object to be searched is a web page, the attribute information of the target object to be searched is specifically one or more of a website topic type, a web page title content, and a web page text classification.
6. The method for semantic search of keywords according to claim 5, wherein the topic type of the website comprises one or more of news, finance, sports, entertainment, and comprehension;
the webpage text classification comprises one or more items of a dispersed text, a narrative text and a comprehensive text.
7. The method for semantic search of a keyword according to claim 1, wherein the word hop probability table is specifically:
analyzing the flow data of the potential target object through the big data, and obtaining the part of speech of each entry in the corresponding flow data according to a word bank matching mode; wherein, the part of speech includes one or more items of nouns, verbs, adjectives, adverbs, prepositions, sentence heads, sentence tails and punctuation marks;
wherein, the jump probability table records the probability of completing the corresponding forward and backward sequence jump among the vocabulary entries corresponding to each part of speech.
8. The method for semantic search of keywords according to claim 1, wherein when the number of words of the keyword exceeds a preset value, before the splitting of the context information content in the initial matching result according to a preset splitting rule is performed to obtain at least two sets of entry objects, the method comprises:
matching to obtain part-of-speech combinations of the keywords according to a word bank;
and obtaining the weighted value in the probability calculation process of each group of vocabulary item objects corresponding to each attribute information according to the part of speech combination.
9. An apparatus for semantically searching for a keyword, the apparatus comprising:
at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor and programmed to perform the method of semantically searching for keywords according to any of claims 1-8.
CN201910433774.6A 2019-05-23 2019-05-23 Method and device for searching keywords according to meanings Active CN110209765B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910433774.6A CN110209765B (en) 2019-05-23 2019-05-23 Method and device for searching keywords according to meanings

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910433774.6A CN110209765B (en) 2019-05-23 2019-05-23 Method and device for searching keywords according to meanings

Publications (2)

Publication Number Publication Date
CN110209765A CN110209765A (en) 2019-09-06
CN110209765B true CN110209765B (en) 2021-03-30

Family

ID=67788362

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910433774.6A Active CN110209765B (en) 2019-05-23 2019-05-23 Method and device for searching keywords according to meanings

Country Status (1)

Country Link
CN (1) CN110209765B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111831832B (en) * 2020-07-27 2022-07-01 北京世纪好未来教育科技有限公司 Word list construction method, electronic device and computer readable medium
CN112468410B (en) * 2020-11-05 2021-10-22 武汉绿色网络信息服务有限责任公司 Method and device for enhancing accuracy of network traffic characteristics

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050060140A1 (en) * 2003-09-15 2005-03-17 Maddox Paul Christopher Using semantic feature structures for document comparisons
CN100568221C (en) * 2004-11-22 2009-12-09 北京北大方正技术研究院有限公司 A kind of method of newspaper layout being carried out the words reading sequence recovery
FR2970795A1 (en) * 2011-01-25 2012-07-27 Synomia Method for filtering of synonyms in electronic document database in information system for searching information in e.g. Internet, involves performing reduction of number of synonyms of keyword based on score value of semantic proximity
CN102346777B (en) * 2011-10-09 2016-06-01 北京百度网讯科技有限公司 A kind of method and apparatus that illustrative sentence retrieval result is ranked up
CN102880645B (en) * 2012-08-24 2015-12-16 上海云叟网络科技有限公司 The intelligent search method of semantization
CN103902521B (en) * 2012-12-24 2017-07-11 高德软件有限公司 A kind of Chinese sentence recognition methods and device
CN104699694B (en) * 2013-12-04 2019-08-23 腾讯科技(深圳)有限公司 Prompt information acquisition methods and device
CN103745011B (en) * 2014-01-28 2017-10-27 广州市一呼百应网络技术股份有限公司 A kind of method of search engine retrieving result intelligent display
AU2014203117A1 (en) * 2014-06-10 2015-12-24 Saha, Syamantak MR Zapaat context internet search engine
CN105138511B (en) * 2015-08-10 2017-12-12 北京思特奇信息技术股份有限公司 A kind of method and system that semantic analysis is carried out to search key
CN106021553A (en) * 2016-05-30 2016-10-12 深圳市华傲数据技术有限公司 Structuralized data matching method and system
CN107544955A (en) * 2016-06-24 2018-01-05 汇仕电子商务(上海)有限公司 Natural language syntactic analysis method and system
CN107562750A (en) * 2016-06-30 2018-01-09 百度在线网络技术(北京)有限公司 A kind of method and apparatus for providing search result
CN108073292B (en) * 2016-11-11 2021-10-15 北京搜狗科技发展有限公司 Intelligent word forming method and device for intelligent word forming
CN107315841A (en) * 2017-07-20 2017-11-03 北京三快在线科技有限公司 A kind of information search method, apparatus and system

Also Published As

Publication number Publication date
CN110209765A (en) 2019-09-06

Similar Documents

Publication Publication Date Title
US9594747B2 (en) Generation of a semantic model from textual listings
CN106874292B (en) Topic processing method and device
CN111143479A (en) Knowledge graph relation extraction and REST service visualization fusion method based on DBSCAN clustering algorithm
CN111190900B (en) JSON data visualization optimization method in cloud computing mode
CN111177591B (en) Knowledge graph-based Web data optimization method for visual requirements
CN112417863B (en) Chinese text classification method based on pre-training word vector model and random forest algorithm
WO2015149533A1 (en) Method and device for word segmentation processing on basis of webpage content classification
US20080168056A1 (en) On-line iterative multistage search engine with text categorization and supervised learning
CN108009135B (en) Method and device for generating document abstract
CN110413787B (en) Text clustering method, device, terminal and storage medium
CN107943792B (en) Statement analysis method and device, terminal device and storage medium
CN111460170B (en) Word recognition method, device, terminal equipment and storage medium
WO2021077585A1 (en) Method and device for auto-completing query
CN111737997A (en) Text similarity determination method, text similarity determination equipment and storage medium
Khan et al. Audio structuring and personalized retrieval using ontologies
CN111291177A (en) Information processing method and device and computer storage medium
CN112699232A (en) Text label extraction method, device, equipment and storage medium
CN109460477B (en) Information collection and classification system and method and retrieval and integration method thereof
CN110209765B (en) Method and device for searching keywords according to meanings
CN105404677A (en) Tree structure based retrieval method
CN116738988A (en) Text detection method, computer device, and storage medium
CN102722526B (en) Part-of-speech classification statistics-based duplicate webpage and approximate webpage identification method
CN110413985B (en) Related text segment searching method and device
JP4359075B2 (en) Concept extraction system, concept extraction method, concept extraction program, and storage medium
CN105426490A (en) Tree structure based indexing method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant