CN110008310A - A kind of content search method and device - Google Patents

A kind of content search method and device Download PDF

Info

Publication number
CN110008310A
CN110008310A CN201910270196.9A CN201910270196A CN110008310A CN 110008310 A CN110008310 A CN 110008310A CN 201910270196 A CN201910270196 A CN 201910270196A CN 110008310 A CN110008310 A CN 110008310A
Authority
CN
China
Prior art keywords
operator
corpus
search
disambiguation
keyword
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910270196.9A
Other languages
Chinese (zh)
Inventor
任宁
卢彦博
晋耀红
李德彦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Shenzhou Taiyue Software Co Ltd
Original Assignee
Beijing Shenzhou Taiyue Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Shenzhou Taiyue Software Co Ltd filed Critical Beijing Shenzhou Taiyue Software Co Ltd
Priority to CN201910270196.9A priority Critical patent/CN110008310A/en
Publication of CN110008310A publication Critical patent/CN110008310A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the present application provides a kind of content search method and device.Wherein, this method include using preset analysis model be corpus add label, it is described be corpus addition label include in corpus specify classification content addition label;Search expression is defined according to search need, and searches for target string from the corpus added with label using described search expression formula;Wherein, described search expression formula includes at least one operator and keyword, and the operator includes aggregation operator, disambiguates operator and relational operator, and each operator forms a content search condition.Thus, technical solution provided by the embodiments of the present application, corpus is analyzed first and adds label, then according to the customized search expression of search need, and logic rules are introduced by the combination of keyword and operator in search expression, to realize the corpus for accurately searching for specified clause from the corpus of magnanimity, corpus collecting efficiency is improved.

Description

A kind of content search method and device
Technical field
This application involves natural language processing technique field more particularly to a kind of content search methods and device.
Background technique
In Language Modeling and other speech researches work, it is often necessary to collect corpus, and be sent out using the corpus collected Now, it summarizes and concludes some language rules, and confirm language rule.In current speech research work, as speech research people When member wishes to study certain a kind of clause, can by daily accumulation, collect, enumerate and imagine etc. that modes obtain the language of such clause Material.For example, when language researcher wishes research " high unhappy " " hope is unwilling " " good or not objects for appreciation " etc. " A not AB " this kind sentence When formula, due in the corpus collected without this kind of corpus, language researcher is usually by way of enumerating the imagination Obtain this kind of corpus, it is clear that the corpus being achieved in that is not comprehensive enough and efficiency is lower, reduces the efficiency and depth of speech research Degree.
If improving the efficiency and depth of speech research, just have to provide a kind of side of more efficient acquisition corpus Method.Therefore, this application provides a kind of content search methods, and the language of specified clause can be accurately searched for from the corpus of magnanimity Thus material improves the efficiency of corpus acquisition.
Summary of the invention
The embodiment of the present application provides a kind of content search method and device, can accurately search for from the corpus of magnanimity The corpus of specified clause, to improve the efficiency of corpus acquisition.
In a first aspect, the embodiment of the present application provides a kind of content search method.This includes: to use preset analysis model For corpus add label, it is described be corpus addition label include in corpus specify classification content addition label;According to search Requirement definition search expression, and target string is searched for from the corpus added with label using described search expression formula;Its In, described search expression formula includes at least one operator and keyword, and the operator includes aggregation operator, disambiguates operator and relationship Operator, each operator form a content search condition.
Second aspect, the embodiment of the present application provide a kind of content search device.The device includes: corpus processing module, It is that corpus adds label for using preset analysis model;The analysis model includes vocabulary model, described to add for corpus Label includes adding label for the content of the specified classification in corpus;Search module, for defining search table according to search need Target string is searched for from the corpus added with label up to formula, and using described search expression formula;Wherein, described search is expressed Formula includes at least one operator and keyword, and the operator includes aggregation operator, disambiguates operator and relational operator, each calculation Son forms a content search condition.
From the above technical scheme, the embodiment of the present application provides a kind of content search method and device.Wherein, the party For method including the use of preset analysis model being that corpus adds label, described is corpus addition label including being to specify classification in corpus Content add label;According to search need define search expression, and using described search expression formula from added with label Target string is searched in corpus;Wherein, described search expression formula includes at least one operator and keyword, and the operator includes Aggregation operator disambiguates operator and relational operator, and each operator forms a content search condition.The application is implemented as a result, The technical solution that example provides, is analyzed corpus first and is added label, is then expressed according to the customized search of search need Formula, and logic rules are introduced by the combination of keyword and operator in search expression, to realize from the corpus of magnanimity The corpus for accurately searching for specified clause, improves corpus collecting efficiency.
Detailed description of the invention
In order to illustrate more clearly of the technical solution of the application, letter will be made to attached drawing needed in the embodiment below Singly introduce, it should be apparent that, for those of ordinary skills, without any creative labor, It is also possible to obtain other drawings based on these drawings.
Fig. 1 is a kind of flow chart of content search method provided by the embodiments of the present application;
Fig. 2 is the flow chart of search target string provided by the embodiments of the present application;
Fig. 3 is the flow chart of search target string step S201 provided by the embodiments of the present application;
Fig. 4 is the flow chart provided by the embodiments of the present application screened to search expression;
Fig. 5 is a kind of flow chart of content search device provided by the embodiments of the present application.
Specific embodiment
In order to make those skilled in the art better understand the technical solutions in the application, below in conjunction with the application reality The attached drawing in example is applied, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described implementation Example is merely a part but not all of the embodiments of the present application.Based on the embodiment in the application, this field is common The application protection all should belong in technical staff's every other embodiment obtained without making creative work Range.
In Language Modeling and other speech researches work, it is often necessary to collect corpus, and be sent out using the corpus collected Now, it summarizes and concludes some language rules, and confirm language rule.In current speech research work, as speech research people When member wishes to study certain a kind of clause, can by daily accumulation, collect, enumerate and imagine etc. that modes obtain the language of such clause Material.For example, when language researcher wishes research " high unhappy " " hope is unwilling " " good or not objects for appreciation " etc. " A not AB " this kind sentence When formula, due to there is no this kind of corpus in the corpus collected, language researcher usually passes through the side for enumerating the imagination Formula obtains this kind of corpus, it is clear that the corpus being achieved in that is not comprehensive enough and efficiency is lower, reduce speech research efficiency and Depth.
If improving the efficiency and depth of speech research, just have to provide a kind of side of more efficient acquisition corpus Method.Therefore, this application provides a kind of content search method and devices, and specified sentence can be accurately searched for from the corpus of magnanimity Thus the corpus of formula improves the efficiency of corpus acquisition.
Here is the present processes embodiment, provides a kind of content search method, this method can be applied to server, PC (PC), tablet computer, mobile phone, smart television, intelligent sound box, virtual reality device and intelligent wearable device etc. are a variety of In equipment.
Fig. 1 is a kind of flow chart of content search method provided by the embodiments of the present application.As shown in Figure 1, this method includes Following steps:
Step S101 is that corpus adds label using preset analysis model, and described is corpus addition label including being language The content of classification is specified to add label in material.
Analysis model may include one or more of vocabulary model, rule model and algorithm model.Wherein, vocabulary mould Type refers to according to some classification method, and word is classified, and the word of certain one kind accumulation is counted the institute in a vocabulary The model of formation;Rule model contains some specific analysis rules, can be found out from corpus according to analysis rule specific Content;Algorithm model contains some specific algorithms, specific content can be matched from corpus according to algorithm.Above-mentioned analysis mould Type is used to add label to the specific content in corpus in this application.
According to the actual demand of content search, can near field existing model selection analysis model, can also be with Newly-built analysis model.For example, one such as table 1 can be created when user needs to search for content related with animal from corpus Shown in animal name vocabulary, as vocabulary model used in this application.
Dog Cat Dinosaur Chicken Elephant Mosquito Snake Ant
Spider Lion Whale Honeybee Giant panda Wild animal Chimpanzee Bat
Crocodile Mouse Tiger Cockroach Horse Shark Monkey Parrot
Dolphin Pig Tortoise Ox Rabbit Octopus Fly Cicada
Frog Giraffe Wolf Polar bear Butterfly Lizard Pigeon Snail
Hamster Boa Sheep Rhinoceros Sparrow Cat owl Goldfish Crow
Chameleon Jellyfish Lobster Dragonfly Firefly Imperial cat Mammoth Penguin
Orangutan Drosophila Bear Squirrel Crab Earthworm Piranha Cobra
Poephila castanotis Cicada Kingfisher Fox Centipede Killer whale Mantis Silkworm
1 animal name vocabulary of table
According to the analysis model of above-mentioned selection, corpus is analyzed, so that class will be specified in corpus based on the analysis results Other content adds label.For example, having in certain article in short:
Cat eats fish, and dog eats meat, and ultraman beats small strange beast
It is available as shown in table 2 for the words above if analyzed using above-mentioned vocabulary model article Content:
Article ID Sentence ID Content Position Classification
1 10 Cat 0,1 Animal
1 10 Fish 2,3 Animal
1 10 Dog 4,5 Animal
2 Concordance result of table
Wherein, " article ID " is the number of article, if in corpus including plurality of articles, corresponding one of every article is only One unduplicated article ID;" sentence ID " be number of the sentence in article, in an article, each sentence have one only One article number, for example, according to sequencing of the sentence in article, this 10 if an article includes 10 sentences The article ID of a sentence is followed successively by 1 to 10, can determine one in an article by an article ID and a sentence ID Sentence;" content " refers to be matched to from the sentence that article ID and sentence ID is determined according to analysis model (such as: vocabulary model) Content;" position " indicates that position of the content being matched in sentence, the position include initial position and end position, specifically Representation method are as follows: the position of first character in sentence is denoted as 0, the position of second character is denoted as 1, and so on, with The content first character position being fitted on is the initial position of content, where the last character for the content being matched to Position adds 1 as end position, is separated between initial position and end position with comma;" classification " refers to the content being matched to Classification, such as cat, dog belong to " animal " class.
According to Concordance as a result, label is added to the content that analysis model is matched in corpus, for example, according to table 2 Analysis result label is added to the content of " animal " class in corpus, as a result may is that
Cat<animal>eats fish<animal>, and dog<animal>eats meat, and ultraman beats small strange beast
Furthermore it is also possible to be added by establishing associated mode and realizing corpus with Concordance result to the content in corpus It tags, thus, any change will not be done to corpus itself, guarantee the accuracy of what subsequent content search of the continuity of corpus.
Step S102 defines search expression according to search need, and using described search expression formula from added with label Corpus in search for target string.
Wherein, described search expression formula includes at least one operator and keyword, and the operator includes aggregation operator, disambiguates Operator and relational operator, each operator form a content search condition.
The search expression of the embodiment of the present application is used to execute in corpus and search in sentence.Search, refers to and searches in so-called sentence When rope expression formula executes search in corpus, scanned for one by one using every words in corpus as search target, without Can across sentence search, such as: comprising sentence 1, sentence 2 and sentence 3 in corpus, when being scanned for using search expression to corpus When, sentence 1, sentence 2 and sentence 3 can be searched for respectively, without carrying out across sentence search between sentence 1 and sentence 2.
It is executed in sentence and is searched for using search expression, meet language worker in speech research mostly using searching in sentence Habit focuses the range of search, and improves search speed.
The operator of search expression includes aggregation operator, and aggregation operator is alternatively referred to as collective concept.Wherein, in set corresponds to The classification of appearance, such as: the title of all animals may be constructed a set, and all botanical names may be constructed a set, All nouns may be constructed a set, and all verbs may be constructed a set, and all Chinese herbal medicine titles may be constructed one Set, etc..
In one embodiment, an aggregation operator is indicated with symbol "<>", may include one or more in symbol "<>" Chinese character or English character specifically may include an element in name set or set.
It specifically, is an element in name set or set in "<>" according to include, aggregation operator can divide For first kind aggregation operator and the second class aggregation operator, wherein element for example can be word, word or phrase.
Comprising name set be first kind aggregation operator in symbol "<>", such as:<n>indicate noun aggregation operator,< Animal > expression animal dvielement aggregation operator.First kind aggregation operator is used to search out the element that set includes from corpus, Such as:<n>for searching noun element from corpus,<Chinese herbal medicine>is used to search the element of Chinese herbal medicine class from corpus,< Animal > for searching the element of animal class, such as cat, dog, fish from corpus.Wherein, it is being searched for corpus using operator Before, corpus can be segmented, to improve search efficiency.
It comprising an example element in set is the second class aggregation operator in symbol "<>", also, in order to first Class aggregation operator is distinguished, the second class aggregation operator comprising element before be also added into difference symbol "~", such as: <~Rehmannia glutinosa>,<~cat>etc..Second class aggregation operator is used for from the element searched out where example element in set in corpus, Such as: " Rehmannia glutinosa " belongs to Chinese herbal medicine classification, and therefore<~Rehmannia glutinosa>from corpus for searching for the member for belonging to Chinese herbal medicine classification Element.
Further, if as soon as example element simultaneously belong to multiple set, then the second class aggregation operator be used for from The element in all set where the example element is searched in corpus.Such as " cat " may belong to animal class set, pets Set and felid set, therefore,<~cat>can search the element, pets for belonging to animal class from corpus The element of element and felid class.
In one embodiment, disambiguating operator includes that can be expressed as " { m, n } " apart from operator apart from operator, wherein with For one character as a parasang, m indicates minimum range, and n indicates maximum distance.It can be with two disambiguations pair apart from operator As combination, for from corpus search comprising two disambiguate the distance of objects and two disambiguation objects in minimum range and described Target string between maximum distance.Wherein, disambiguating object includes aggregation operator and/or keyword, in the embodiment of the present application In, keyword is used to carry out full word search to corpus.
The typically combining form apart from operator and disambiguation object is illustrated below:
Eat { 0,5 }<animal>
It wherein, is " { 0,5 } " apart from operator, disambiguating object includes keyword " eating " and aggregation operator "<animal>", the group Conjunction form is used for search and " eating " and "<animal>" matched content from corpus, wherein " eating " and "<animal>" is matched to Content meets between minimum range 0 and maximum distance 5 (including 0 and 5) in the distance in corpus.Such as: in " eating a lobster " " eating " and " lobster " can be searched by said combination form.
In one embodiment, relational operator includes "or" operator, and "or" operator can be expressed as " (|) ", wherein "or" Operator and two "or" object compositions use, and "or" object may include aggregation operator and/or keyword, two "or" objects point It Wei Yu not be before " | " and after " | ".It include any one for being searched for from corpus after "or" operator and "or" object composition The target string of "or" object.
Illustratively, (<nt>|<n>) be used for from search in corpus comprising in any one of group, mechanism name (nt) or noun Hold;(carry forward | development) it is used to search for from corpus comprising " carrying forward " or " development " any one content, such as: " carry forward devotion essence Mind " can (carried forward | development) search, " developing advanced productivity " also can (carried forward | development) search.
In some embodiments, it may include in "or" operator more than "or" object.Such as: (<n>| development | revitalize), it uses It include noun, " development " or " development " any one content in being searched for from corpus.
In some embodiments, one or more "or" operators can combine to form increasingly complex shape with other operators Formula, such as:
(<nt>|<n>) { 0,5 }<v>no
(carry forward | development) { 0,9 } (culture | spirit)
Cultural { 0,3 } (<n>| development | revitalize)
In some embodiments, by being combined to each class operator, a kind of clause can be summarized, such as:
<v (1)><v (2)=v (1)>can not summarize and such as " eat " the identical clause of verb before and after " good or not ";
<v (1)><n>not<v (2)=v (1)>can to summarize verb before and after " hit the person and do not beat " of such as " having a meal and do not eat " identical Clause;
<v (1)><n (1)><v (2)=v (1)><n (2)=n (1)>can not summarize such as " hitting the person " and " have a meal and do not eat Verb and all identical clause of noun before and after meal " etc..
In one embodiment, disambiguating operator further includes just disambiguating operator and bearing to disambiguate operator.
Wherein, it is just disambiguating operator and is disambiguating object composition, for target character of the search comprising disambiguating object from corpus String;It is positive to disambiguate operator and apart from operator and disambiguate object composition, for searching out position of the satisfaction apart from operator from corpus about Beam condition and include disambiguate object target string;It is negative to disambiguate operator and apart from operator and disambiguate object composition, be used for from The position constraint condition met apart from operator is searched out in corpus and does not include the target string for disambiguating object.
In one embodiment, "+" can be expressed as by just disambiguating operator, and the negative operator that disambiguates can be expressed as "-", positive to disambiguate Operator negative disambiguate being applied in combination for operator and other operators and is mainly used for limiting certain contents and goes out in matched target string It is existing, and limit other contents and do not occur in matched target string, the citation form of a combination thereof for example:
+X1-{m,n}X2
Wherein, X1 and X2 is to disambiguate object, can specifically include aggregation operator and keyword etc.;X1 indicates matched target The content that should include in character string, X2 expression are matched to the content that appearance is limited in target string, and { m, n } is that distance is calculated Son, therefore the meaning of said combination expression are as follows: when X1 occurs, there is not allowed that X2 in m-n character range after X1.Below The positive application for disambiguating operator and negative disambiguation operator is illustrated in conjunction with more examples:
Example one :-{ 0,5 } credit+card-{ 0,7 } is reported the loss
Indicate that there can be no credits in 5 characters before " card ";There can be no report the loss in 7 characters after card.
Example two :+card-{ 0,7 } reports the loss+and { 0,5 } find
Indicate " finding " should occur in 5 characters after " card ", there can be no extensions in 7 characters after " card " It loses.
As a result, by just disambiguating operator, negative disambiguation operator, the combination apart from operator and disambiguation object, to design on demand Search expression, user can accurately position from corpus and find target string.Such as: when user wants from corpus Middle search address, but when not being household register address, so that it may use following search expression:
Household register+{ 0,2 } address
In the search expression comprising just disambiguating operator and negative disambiguation operator, in order to guarantee search logic rationally and mention High search performance, search expression should also meet claimed below:
1. just disambiguating operator "+" cannot omit;
2. the negative operator "-" that disambiguates cannot be used continuously;
3. the negative operator that disambiguates must follow appearance distance operator closely later;
In addition, for all search expressions, should also meet claimed below to improve search performance:
1. "or" operator can not multilayer nest use;
2. cannot there is no keyword in search expression
3. keyword cannot be only present in "or" operator;
4. keyword cannot be only present in after negative disambiguation operator "-".
Fig. 2 is the flow chart of search target string provided by the embodiments of the present application.
As shown in Fig. 2, after step S102 defines search expression, using search expression from the language added with label In material search for target string the following steps are included:
Described search expression formula is divided into multiple slots according to preset slot separator, and obtains the slot by step S201 First location information, wherein each slot include an expression being made of the aggregation operator and/or the keyword Formula segment.
In some embodiments, slot separator includes disambiguating operator and relational operator, based on disambiguation operator and relational operator As slot separator, step S201 can specifically include following steps as shown in Figure 3:
Described search expression formula is divided into multiple slots using the disambiguation operator, and obtains described first by step S301 Location information.
Operator is disambiguated firstly, searching in search expression, and disambiguates the combination of operator, such as: "+" "+{ m, n } " "- {m,n}";Then, slot separator is used as with above-mentioned "+" "+{ m, n } " "-{ m, n } " found etc., in the position of slot separator First time segmentation is carried out to search expression, obtains multiple expression formula segments being made of aggregation operator and/or keyword, each Expression formula segment is as a slot;Finally, accounting for a position according to each character using the initial character of search expression as position 1 Unit is set, determines the first location information of each expression formula segment.
Illustratively, search expression are as follows:+(<card>| bank card)-{ 0,7 } reports the loss+and { 0,5 } find
Wherein, slot separator includes: "+" "-{ 0,7 } " "+{ 0,5 } ", and therefore, above-mentioned search expression is divided for the first time The slot and first location information arrived is as shown in table 3:
Expression formula segment Initial position End position
Slot 1 (<card>| bank card) 2 11
Slot 2 It reports the loss 17 19
Slot 3 It finds 25 27
Table 3 first time segmentation result
From table 3 it can be seen that the first location information of expression formula segment is made of initial position and end position, wherein Using the position of the first character of search expression as position 1, each character occupies 1 position, then initial position is expression Position of the first character of formula segment in search expression, end position are that the last character of expression formula segment is being searched Position in rope expression formula adds 1.
Step S302, analyze the expression formula segment that the slot includes whether inclusion relation operator, if inclusion relation Operator carries out second and divides, by the expression formula fragment segmentation at multiple using the relational operator as the slot separator Slot, and update the first location information.
Illustratively, in the result being shown in Table 3, in slot 1 " (<card>| bank card) " it include "or" operator, therefore, with "or" operator is slot separator, by the expression formula fragment segmentation in slot 1 at "<card>" and " bank card " at " | ", and is updated First location information is as shown in table 4:
Expression formula segment Initial position End position
Slot 1 <card> 2 6
Slot 2 Bank card 7 10
Slot 3 It reports the loss 17 19
Slot 4 It finds 25 27
Second of the segmentation result of table 4
In addition, it is necessary to remark additionally, comprising being combined by aggregation operator and keyword etc. in IF expression segment, then In second of cutting, the keyword individual segmentation in combination is come out.
Illustratively, IF expression segment are as follows:<v>not<n>, then splits into "<v>" " not " and "<n>".
Step S202 obtains kernel keyword from the expression formula segment comprising keyword according to default screening conditions.
Wherein, obtaining the screening conditions that kernel keyword uses may include:
1. kernel keyword not with the aggregation operator component relationship operator;
2. kernel keyword is instead of occurring at first expression formula segment after negative disambiguation operator.
Illustratively, the keyword " bank card " in table 4 and aggregation operator "<card>" constitute "or" operator, therefore " silver Row card " is not kernel keyword;Keyword " reporting the loss " is the negative first expression formula segment disambiguated after operator "-", therefore " is hung Lose " nor kernel keyword;Therefore, finally determine that " finding " is kernel keyword.
Step S203 searches out all sentences comprising the kernel keyword from the corpus.
Due to being analyzed in step s101 using analysis model corpus, having obtained Concordance shown in table 2 As a result, it is determined that the information such as article ID, sentence ID therefore, can after search is comprising the sentence of kernel keyword in corpus Determine the article ID and sentence ID each where the sentence comprising kernel keyword.
Step S204, using the aggregation operator and the keyword of described search expression formula found out from sentence to A few matching content, and obtain second location information of each matching content in the sentence.
Illustratively, when kernel keyword is " finding ", step S203 can obtain sentence from corpus:
After peony-card is reported the lossIt finds?
So using search expression :+(<card>| bank card)-{ 0,7 } reports the loss+and { 0,5 } find, and combine the slot of table 4 Analysis result, which searches above-mentioned sentence, can obtain result shown in table 5:
5 matching content lookup result of table
As shown in table 5, second location information is made of initial position and end position, and the initial position and end position can To be obtained from analysis result shown in table 2 after step S101 analyzes corpus.
Step S205, according to the disambiguation of the first location information, the second location information and described search expression formula Operator, analyzes whether the matching content meets the distance and disambiguate requirement that search expression defines, will be right if met the requirements The sentence answered is exported as target string.
Illustratively:
Search expression are as follows:+(<card>| bank card)-{ 0,7 } reports the loss+and { 0,5 } find;
Matching content are as follows: after peony-card is reported the lossIt finds?
According to the second location information in table 5, " peony-card " positioned at slot 1 is separated by 0 character with " the reporting the loss " for being located at slot 3 (" peony-card " end position is identical as the initial position of " reporting the loss "), therefore " peony-card " and " reporting the loss " meets between slot 1 and slot 3 The required distance apart from operator " { 0,7 } ", but the 0-7 word due to the negative presence for disambiguating operator "-", after " peony-card " " reporting the loss " should not occur in symbol range, therefore " peony-card " and " reporting the loss " is unsatisfactory for disambiguating and require;So " peony-card is reported the loss After have found " will not as target string export.
Illustratively:
Search expression are as follows:+(<card>| bank card)-{ 0,7 } reports the loss+and { 0,5 } find;
Matching content are as follows: peony-card is finallyIt finds?
According to the second location information in table 5, " peony-card " positioned at slot 1 is separated by 2 characters with " the finding " for being located at slot 4 (difference of the initial position of " peony-card " end position and " finding " is 2 characters), therefore " peony-card " and " finding " meets slot 1 The required distance of the distance between slot 3 operator " { 0,5 } ", simultaneously as just disambiguate operator "+" presence, " peony-card " and " finding ", which meets to disambiguate, to be required;Also, without there is " reporting the loss " in 0-7 character range after " peony-card ", same satisfaction away from From requiring and disambiguate requirement;So " peony-card has finally found " can export as target string.
In some embodiments, in order to further increase corpus search precision, segment the corpus to be searched clause, After performing step S201, execute step S202 before, can also according to the search expression segmentation result of step S201, Primary screening is carried out to the expression formula segment that segmentation obtains, the search expression for not meeting searching requirement is removed.
Fig. 4 is the flow chart provided by the embodiments of the present application screened to search expression.As shown in figure 4, the screening Process the following steps are included:
Step S401 is matched using multiple regular expressions with the expression formula segment that the slot includes.
Wherein, regular expression is arranged according to searching requirement.Such as: it include " high unhappy " when requiring the search from corpus When " hope is unwilling " " good or not object for appreciation " etc. " A not AB " this kind of clause, it can be defined as follows two regular expressions:
<.*\(\d+\)>
<.* (d+)=(d+)>
According to the content illustrated above search expression, the corresponding search expression of " A not AB " this kind of clause are as follows: < v (1)>not<v (2)=v (1)>, then, in two regular expressions defined above, "<.* (d+)>" is for matching "<v (1)>", "<.* (d+)=(d+)>" for matching "<v (2)=v (1)>".
It should be added that the syntactic rule of regular expression belongs to state of the art, the application is not superfluous It states.
Step S402 is not matched to the expression formula segment if there is regular expression described at least one, then really Determine described search expression formula and does not meet searching requirement.
Specifically, when two regular expressions are matched to the expression formula segment that step S201 is obtained simultaneously, then explanation is searched Comprising "<v (1)>not<v (2)=v (1)>" in rope expression formula, it is thus determined that search expression meets searching requirement, if there is At least one regular expression is not matched to corresponding expression formula segment, it is determined that search expression does not meet searching requirement. In addition, in order to guarantee accuracy of judgement, when two regular expressions are matched to expression formula segment simultaneously, it is also necessary to judge the two Whether the content that expression formula is matched to is consistent, such as: " v (1) " and second canonical table that first regular expression matching arrives Whether " v (1) " being matched to up to formula is identical, if identical, it is determined that search expression meets searching requirement.
From the above technical scheme, the embodiment of the present application provides a kind of content search method, comprising: using preset Analysis model be corpus add label, it is described be corpus addition label include in corpus specify classification content addition label; Search expression is defined according to search need, and searches for target word from the corpus added with label using described search expression formula Symbol string;Wherein, described search expression formula includes at least one operator and keyword, and the operator includes aggregation operator, disambiguates and calculate Son and relational operator, each operator form a content search condition.Method provided by the embodiments of the present application as a result, it is first Label is first analyzed corpus and is added, then according to the customized search expression of search need, and in search expression Logic rules are introduced by the combination of keyword and operator, accurately search for specified clause from the corpus of magnanimity to realize Corpus improves corpus collecting efficiency.
Here is the Installation practice of the application, provides a kind of content search device, the device can be applied to server, PC (PC), tablet computer, mobile phone, smart television, intelligent sound box, virtual reality device and intelligent wearable device etc. are a variety of In equipment, undocumented details in the Installation practice of the application please refers to the present processes embodiment.
Fig. 5 is a kind of flow chart of content search device provided by the embodiments of the present application.As shown in figure 5, the device includes:
Corpus processing module 501, for the use of preset analysis model being that corpus adds label;The analysis model includes Vocabulary model, it is described be corpus addition label include for the specified classification in corpus content add label;
Search module 502 for defining search expression according to search need, and uses described search expression formula from addition Have and searches for target string in the corpus of label;
Wherein, described search expression formula includes at least one operator and keyword, and the operator includes aggregation operator, disambiguates Operator and relational operator, each operator form a content search condition.
From the above technical scheme, the embodiment of the present application provides a kind of content search device, for using preset Analysis model be corpus add label, it is described be corpus addition label include in corpus specify classification content addition label; Search expression is defined according to search need, and searches for target word from the corpus added with label using described search expression formula Symbol string;Wherein, described search expression formula includes at least one operator and keyword, and the operator includes aggregation operator, disambiguates and calculate Son and relational operator, each operator form a content search condition.Device provided by the embodiments of the present application as a result, energy It is enough to be analyzed corpus and added label, then according to the customized search expression of search need, and in search expression Logic rules are introduced by the combination of keyword and operator, accurately search for specified clause from the corpus of magnanimity to realize Corpus improves corpus collecting efficiency.
Those skilled in the art will readily occur to its of the application after considering specification and practicing application disclosed herein Its embodiment.This application is intended to cover any variations, uses, or adaptations of the application, these modifications, purposes or Person's adaptive change follows the general principle of the application and including the undocumented common knowledge in the art of the application Or conventional techniques.The description and examples are only to be considered as illustrative, and the true scope and spirit of the application are by following Claim is pointed out.
It should be understood that the application is not limited to the precise structure that has been described above and shown in the drawings, and And various modifications and changes may be made without departing from the scope thereof.Scope of the present application is only limited by the accompanying claims.

Claims (10)

1. a kind of content search method characterized by comprising
It the use of preset analysis model is that corpus adds label, described be corpus addition label includes for classification specified in corpus Content adds label;
Search expression is defined according to search need, and searches for mesh from the corpus added with label using described search expression formula Mark character string;
Wherein, described search expression formula includes at least one operator and keyword, and the operator includes aggregation operator, disambiguates operator And relational operator, each operator form a content search condition.
2. the method according to claim 1, wherein
The aggregation operator includes first kind aggregation operator and the second class aggregation operator;
Wherein, the first kind aggregation operator includes a name set, the element for including for searching out set from corpus, The element includes at least word, word and phrase;The worthwhile attached bag of second class set contains an example element, for searching from corpus Element where rope goes out the example element in set.
3. the method according to claim 1, wherein
The disambiguation operator include apart from operator, it is described to indicate a minimum range and a maximum distance apart from operator;It is described Apart from operator and two disambiguation object compositions, for being searched for from corpus comprising disappearing described in two disambiguation objects and two Target string of the distance of discrimination object between the minimum range and the maximum distance;
Wherein, the disambiguation object includes the aggregation operator and/or the keyword, and the keyword is used for the corpus Carry out full word search.
4. the method according to claim 1, wherein
The relational operator includes "or" operator, the "or" operator and two "or" object compositions, for searching for from corpus Target string comprising any one of "or" object;
Wherein, the "or" object includes the aggregation operator and/or the keyword, and the keyword is used for the corpus Carry out full word search.
5. method according to any of claims 1-4, which is characterized in that it is described using search expression from being added with Target string is searched in the corpus of label, comprising:
Described search expression formula is divided into multiple slots according to preset slot separator, and obtains the first position letter of the slot Breath, wherein each slot includes an expression formula segment being made of the aggregation operator and/or the keyword;
Kernel keyword is obtained from the expression formula segment comprising keyword according to default screening conditions;
All sentences comprising the kernel keyword are searched out from the corpus;
Using described search expression formula the aggregation operator and the keyword from found out in sentence at least one matching in Hold, and obtains second location information of each matching content in the sentence;
According to the disambiguation operator of the first location information, the second location information and described search expression formula, described in analysis Whether matching content, which meets distance that search expression defines and disambiguate, requires, if met the requirements, using corresponding sentence as Target string output.
6. according to the method described in claim 5, it is characterized in that, the slot separator includes disambiguating operator and relational operator, It is described that described search expression formula is divided by multiple slots according to preset slot separator, and obtain the first position letter of the slot Breath, comprising:
Described search expression formula is divided into multiple slots using the disambiguation operator, and obtains the first location information;
Analyze the expression formula segment that the slot includes whether inclusion relation operator, if inclusion relation operator, with the pass It is operator as the slot separator, by the expression formula fragment segmentation at multiple slots, and updates the first location information.
7. according to the method described in claim 5, it is characterized in that,
The disambiguation operator further includes positive disambiguation operator and negative disambiguation operator;
The positive disambiguation operator and disambiguation object composition, for target word of the search comprising the disambiguation object from the corpus Symbol string;
The positive disambiguation operator and it is described apart from operator and disambiguation object composition, for being searched out described in satisfaction from the corpus Position constraint condition apart from operator and include the target string for disambiguating object;
The negative disambiguation operator and it is described apart from operator and disambiguation object composition, for being searched out described in satisfaction from the corpus Position constraint condition apart from operator and the target string not comprising the disambiguation object;
Wherein, the disambiguation object includes the aggregation operator and/or the keyword, and the keyword is used for the corpus Carry out full word search.
8. the method according to the description of claim 7 is characterized in that the screening conditions, comprising:
Kernel keyword not with the aggregation operator component relationship operator;
Kernel keyword is instead of occurring at first expression formula segment after the negative disambiguation operator.
9. according to the method described in claim 5, it is characterized by further comprising:
It is matched using multiple regular expressions with the expression formula segment that the slot includes;
The expression formula segment is not matched to if there is regular expression described at least one, it is determined that described search expression Formula does not meet searching requirement;
Wherein, the regular expression requires to be arranged according to described search.
10. a kind of content search device characterized by comprising
Corpus processing module, for the use of preset analysis model being that corpus adds label;The analysis model includes vocabulary mould Type, it is described be corpus addition label include for the specified classification in corpus content add label;
Search module, for defining search expression according to search need, and using described search expression formula from added with label Corpus in search for target string;
Wherein, described search expression formula includes at least one operator and keyword, and the operator includes aggregation operator, disambiguates operator And relational operator, each operator form a content search condition.
CN201910270196.9A 2019-04-04 2019-04-04 A kind of content search method and device Pending CN110008310A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910270196.9A CN110008310A (en) 2019-04-04 2019-04-04 A kind of content search method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910270196.9A CN110008310A (en) 2019-04-04 2019-04-04 A kind of content search method and device

Publications (1)

Publication Number Publication Date
CN110008310A true CN110008310A (en) 2019-07-12

Family

ID=67169945

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910270196.9A Pending CN110008310A (en) 2019-04-04 2019-04-04 A kind of content search method and device

Country Status (1)

Country Link
CN (1) CN110008310A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103970886A (en) * 2014-05-20 2014-08-06 重庆大学 Retrieval method for two-stage hobbing technology similarity cases
CN105677639A (en) * 2016-01-10 2016-06-15 齐鲁工业大学 English word sense disambiguation method based on phrase structure syntax tree
US20170024461A1 (en) * 2015-07-23 2017-01-26 International Business Machines Corporation Context sensitive query expansion
CN107193798A (en) * 2017-05-17 2017-09-22 南京大学 A kind of examination question understanding method in rule-based examination question class automatically request-answering system
CN108182234A (en) * 2017-12-27 2018-06-19 中科鼎富(北京)科技发展有限公司 Regular expression screening technique and device
US20180300410A1 (en) * 2006-06-12 2018-10-18 Samuel S. Epstein Methods and apparatuses for searching content

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180300410A1 (en) * 2006-06-12 2018-10-18 Samuel S. Epstein Methods and apparatuses for searching content
CN103970886A (en) * 2014-05-20 2014-08-06 重庆大学 Retrieval method for two-stage hobbing technology similarity cases
US20170024461A1 (en) * 2015-07-23 2017-01-26 International Business Machines Corporation Context sensitive query expansion
CN105677639A (en) * 2016-01-10 2016-06-15 齐鲁工业大学 English word sense disambiguation method based on phrase structure syntax tree
CN107193798A (en) * 2017-05-17 2017-09-22 南京大学 A kind of examination question understanding method in rule-based examination question class automatically request-answering system
CN108182234A (en) * 2017-12-27 2018-06-19 中科鼎富(北京)科技发展有限公司 Regular expression screening technique and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
钟诚 等: "《信息检索与利用》", 31 July 2017 *

Similar Documents

Publication Publication Date Title
Mubarak et al. Arabic offensive language on twitter: Analysis and experiments
CN110427463B (en) Search statement response method and device, server and storage medium
CN106503192B (en) Name entity recognition method and device based on artificial intelligence
Jusoh et al. Techniques, applications and challenging issue in text mining
CN104063387B (en) Apparatus and method of extracting keywords in the text
CN112287684B (en) Short text auditing method and device for fusion variant word recognition
Biemann Structure discovery in natural language
WO2018153215A1 (en) Method for automatically generating sentence sample with similar semantics
CN110851593B (en) Complex value word vector construction method based on position and semantics
CN112989208B (en) Information recommendation method and device, electronic equipment and storage medium
Schwartz et al. Choosing the right words: Characterizing and reducing error of the word count approach
Edwards et al. Identifying wildlife observations on twitter
Zhang et al. A context-enriched neural network method for recognizing lexical entailment
CN109472022A (en) New word identification method and terminal device based on machine learning
CN109446299B (en) Method and system for searching e-mail content based on event recognition
CN106126605A (en) A kind of short text classification method based on user&#39;s portrait
Ogada et al. N-gram based text categorization method for improved data mining
CN111325018A (en) Domain dictionary construction method based on web retrieval and new word discovery
Mei et al. MSRA-USTC-SJTU at TRECVID 2007: High-Level Feature Extraction and Search.
Popchev et al. Text Mining in the Domain of Plant Genetic Resources
Takeshita et al. Speciesist language and nonhuman animal bias in English Masked Language Models
CN116757195B (en) Implicit emotion recognition method based on prompt learning
Nguyen et al. Extracting bacteria biotopes with semi-supervised named entity recognition and coreference resolution
Iosif et al. SemSim: Resources for Normalized Semantic Similarity Computation Using Lexical Networks.
CN110008310A (en) A kind of content search method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
EE01 Entry into force of recordation of patent licensing contract
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20190712

Assignee: Zhongke Dingfu (Beijing) Science and Technology Development Co., Ltd.

Assignor: Beijing Shenzhou Taiyue Software Co., Ltd.

Contract record no.: X2019990000214

Denomination of invention: Content searching method and apparatus

License type: Exclusive License

Record date: 20191127

CB02 Change of applicant information
CB02 Change of applicant information

Address after: Room 818, 8 / F, 34 Haidian Street, Haidian District, Beijing 100080

Applicant after: BEIJING ULTRAPOWER SOFTWARE Co.,Ltd.

Address before: 100089 Beijing city Haidian District wanquanzhuang Road No. 28 Wanliu new building block A Room 601

Applicant before: BEIJING ULTRAPOWER SOFTWARE Co.,Ltd.

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190712