CN110263127A - Text search method and device is carried out based on user query word - Google Patents

Text search method and device is carried out based on user query word Download PDF

Info

Publication number
CN110263127A
CN110263127A CN201910544979.1A CN201910544979A CN110263127A CN 110263127 A CN110263127 A CN 110263127A CN 201910544979 A CN201910544979 A CN 201910544979A CN 110263127 A CN110263127 A CN 110263127A
Authority
CN
China
Prior art keywords
segment
participle
core
speech
participle segment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910544979.1A
Other languages
Chinese (zh)
Inventor
王晓珂
潘希阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Chuangxin Journey Network Technology Co Ltd
Original Assignee
Beijing Chuangxin Journey Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Chuangxin Journey Network Technology Co Ltd filed Critical Beijing Chuangxin Journey Network Technology Co Ltd
Priority to CN201910544979.1A priority Critical patent/CN110263127A/en
Publication of CN110263127A publication Critical patent/CN110263127A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Machine Translation (AREA)

Abstract

The present embodiments relate to text search method is carried out based on user query word, carrying out text search method based on user query word includes: to segment to user query word, obtains participle segment;Preset natural language rule model is called, the natural language rule model is constituted at least one of part of speech, syntactic structure and name entity in attribute based on natural language and predefined, and exporting includes core participle segment or non-core participle segment;The participle segment screens the participle segment as the input parameter of the natural language rule model, and according to the output of the natural language model, obtains the first core participle segment;Text search is carried out using first core participle segment.Through the embodiment of the present invention, text search precision can be improved.

Description

Text search method and device is carried out based on user query word
Technical field
It is more particularly to a kind of to be searched based on user query word progress text the present embodiments relate to field of information processing Rope method and device.
Background technique
Currently, search system is mainly segmented according to the user query word that user inputs, then with obtained participle piece Section random fit, inverted index library carry out document searching matching, and the document that will match to returns to user according to sequence, due to Segment segment random fit removal search match document, in the process, it may appear that less relevant document be retrieved and show or Cause to search out the phenomenon that a large amount of irrelevant document has flooded useful document after the matching of person's participle piece core dumped.
Summary of the invention
In order to solve the above-mentioned problems in the prior art, the embodiment of the invention provides carried out based on user query word Text search method and device.
In a first aspect, the embodiment of the present invention, which provides one kind, carries out text search method based on user query word, it should be based on use It includes: to segment to user query word that family query word, which carries out text search method, obtains participle segment;It calls preset Natural language rule model, the natural language rule model based on natural language constitute attribute in part of speech, syntactic structure and At least one of entity is named to predefine, and exporting includes core participle segment or non-core participle segment;It will be described point Input parameter of the word segment as the natural language rule model, and the output according to the natural language model is to described point Word segment is screened, and the first core participle segment is obtained;Text search is carried out using first core participle segment.
In one embodiment, the method also includes: call training pattern trained in advance, the training pattern is based on institute It states at least one of part of speech, word length, syntactic structure and name entity that natural language is constituted in attribute to predefine, and exports Including for determining that the participle segment becomes the weighted value of core participle segment;Using the participle segment as the trained mould The input parameter of type, and the output according to the training pattern determines that the participle segment becomes the weight of core participle segment Value;The weighted value for becoming core participle segment according to the participle segment, determines that the second core segments segment, second core Segment in segment includes that first core segments segment;Text search is carried out using second core participle segment.
In one embodiment, the method also includes: the quantity of confirmation first core participle segment is not up to default Amount threshold;The quantity of the second core participle segment is the preset quantity threshold value.
In one embodiment, the method also includes: predefine the name entity in the following way: based on preparatory Trained name physical model and preset name Entities Matching rule are named entity to each participle segment respectively Identification;It, will when the name physical model and the name Entities Matching rule alternative one identify to obtain name entity Identify that obtained name entity is determined as the name entity of the participle segment;It is real in the name physical model and the name Body matching rule identifies when obtaining name entity, and the name entity that the name Entities Matching rule identifies is determined For the name entity of the participle segment.
In one embodiment, the method also includes: predefine the part of speech in the following way:
Each participle segment is carried out based on trained part-of-speech tagging model in advance and preset part of speech matching rule Part of speech identification;When the part-of-speech tagging model and the part of speech matching rule alternative one identify to obtain part of speech, it will identify Obtained part of speech is determined as the part of speech of the participle segment;It is identified in the part-of-speech tagging model and the part of speech matching rule When obtaining part of speech, the part of speech that the part of speech matching rule identifies is determined as to the part of speech of the participle segment.
In one embodiment, the method also includes: predefine the syntactic structure in the following way: based on preparatory Trained syntax structure model and preset syntactic structure matching rule carry out syntactic structure identification to each participle segment; When the syntax structure model and the syntactic structure matching rule alternative one identify to obtain syntactic structure, it will identify To part of speech be determined as it is described participle segment syntactic structure;In the syntax structure model and the syntactic structure matching rule When identification obtains syntactic structure, the recognition result of the syntactic structure matching rule is determined as to the sentence of the participle segment Method structure.
Second aspect, the embodiment of the present invention provide one kind and carry out text search device based on user query word, should be based on use It includes: participle unit that family query word, which carries out text search device, for segmenting to user query word, obtains participle segment; Call unit, for calling preset natural language rule model, the natural language rule model is based on natural language It constitutes at least one of the part of speech in attribute, syntactic structure and name entity to predefine, and exporting includes that core segments piece Section or non-core participle segment;Processing unit, for using the participle segment as the input of the natural language rule model Parameter, and the participle segment is screened according to the output of the natural language model, obtain the first core participle segment; Search unit, for carrying out text search using first core participle segment.
In one embodiment, the call unit is also used to: calling training pattern trained in advance, the training pattern base At least one of the part of speech in attribute, word length, syntactic structure and name entity is constituted in the natural language to predefine, and Output includes for determining that the participle segment becomes the weighted value of core participle segment;Using the participle segment as the instruction Practice the input parameter of model, and the output according to the training pattern determines that the participle segment becomes the power of core participle segment Weight values;The weighted value for becoming core participle segment according to the participle segment, determines that the second core segments segment, second core It includes that first core segments segment that the heart, which segments in segment,;Text search is carried out using second core participle segment.
The third aspect, the embodiment of the present invention provide a kind of electronic equipment, and electronic equipment includes: memory, refer to for storing It enables;And processor, the above-mentioned any method of instruction execution for calling memory to store.
Fourth aspect, the embodiment of the present invention provide a kind of computer readable storage medium, and computer readable storage medium is deposited Computer executable instructions are contained, when computer executable instructions are run on computers, execute above-mentioned any method.
It is provided in an embodiment of the present invention that text search method and device is carried out based on user query word, by user query Word is segmented, and is obtained participle segment, is called preset natural language rule model, pre- in the natural language rule model The rule of core participle segment and non-core participle segment has first been determined, can determine which point using the natural language rule model Word segment is core participle segment, carries out text search using determining core participle segment, text search precision can be improved.
Detailed description of the invention
The following detailed description is read with reference to the accompanying drawings, above-mentioned and other purposes, the feature of embodiment of the present invention It will become prone to understand with advantage.In the accompanying drawings, several implementations of the invention are shown by way of example rather than limitation Mode, in which:
Fig. 1 is provided in an embodiment of the present invention based on user query word progress text search method implementation flow chart;
Fig. 2 is provided in an embodiment of the present invention based on user query word progress another implementation flow chart of text search method;
Fig. 3 is provided in an embodiment of the present invention based on user query word progress text search schematic device.
Specific embodiment
The principle and spirit of the invention are described below with reference to several illustrative embodiments.It should be appreciated that providing this A little embodiments are used for the purpose of making those skilled in the art can better understand that realizing the present invention in turn, and be not with any Mode limits the scope of the invention.
Although being noted that the statements such as " first " used herein, " second " to describe embodiments of the present invention not Same module, step and data etc., still the statement such as " first ", " second " is merely in different modules, step and data etc. Between distinguish, and be not offered as specific sequence or significance level.In fact, the statements such as " first ", " second " are complete It may be used interchangeably.
It is provided in an embodiment of the present invention that text search method and device is carried out based on user query word, it can be applied to website The application scenarios of text search are carried out according to the user query word of user's input.Under the scene, carried out based on user query word The executing subject of text search method, which can be, carries out text search device based on user query word, which can be with It is the terminals such as server, computer or mobile terminal, the embodiment of the present invention is without limitation.
Fig. 1 is provided in an embodiment of the present invention based on user query word progress text search method implementation flow chart.Such as Fig. 1 It is shown, this method comprises:
S101: segmenting user query word, obtains participle segment.
User query word in the embodiment of the present invention is the word that user needs to input according to search, such as user is according to search " winter, Russia was joyful " for needing to input is known as user query word.User query word is segmented, participle piece is obtained Section, such as user query word " winter, Russia was joyful " is segmented, obtained " winter ", " Russia ", " joyful ", " " is the participle segment after segmenting.
S102: calling preset natural language rule model, and the natural language rule model is based on natural language It constitutes at least one of the part of speech in attribute, syntactic structure and name entity to predefine, and exporting includes that core segments piece Section or non-core participle segment.
In the embodiment of the present invention, in order to identify, which participle segment is core participle segment in participle segment, can be according to certainly Right Language Processing mode presets natural language rule model, which can be is retouched based on natural language Natural language used in the process of stating constitutes attribute and determines, which, which constitutes attribute, for example can be part of speech, or can also To be syntactic structure, be perhaps also possible to name entity or can also be part of speech, syntactic structure and name entity in multiple groups It closes.It is understood that for determining that the language element attribute of natural language rule model is not limited to part of speech, sentence in the disclosure Method structure and name entity, can also be other language element attributes, such as to can be word long.
The natural language rule model can according to preset part-of-speech rule library or preset syntactic structure rule base, or The preset name entity rule library of person carries out part of speech or syntactic structure based on the mode of dictionary pattern matching to the participle segment of input Or name entity identification label, natural language rule model or according to preset part-of-speech rule library, syntactic structure advise Then multiple combinations in library and preset name entity rule library, the participle segment of input is carried out based on the mode of dictionary pattern matching Multiple combined identification labels in part of speech, syntactic structure and name entity.The input of natural language rule model in the disclosure It can be participle segment, export and segment segment or non-core word segment for core.Nature language can be preset in possible embodiment Say the rule in rule model.After participle segment is input to natural language rule model, meet preset core participle Fragment rule, exporting the participle segment is that core segments segment.Core participle fragment rule is not met, the participle piece is exported Section is non-core participle segment.
In one example, the rule in natural language rule model for example can be used under type such as and predefine: on the one hand It can be attribute, subject or the adverbial modifier by syntactic structure, and part of speech is that the participle segment of noun, proper noun or time word is determined as Core segments segment, is otherwise determined as non-core participle segment.It on the other hand can be attribute, subject by syntactic structure, and word Property be that the participle segment of noun or time word is determined as core participle segment, be otherwise determined as non-core participle segment.It is natural The specific rules set in language rule model can be debugged according to practical application, and the embodiment of the present invention is it is not limited here.
S103: using the participle segment as the input parameter of the natural language rule model, and according to the nature The participle segment is screened in the output of language rule model, obtains the first core participle segment.
In the embodiment of the present invention, natural language rule model, natural language rule are inputted using participle segment as input parameter Then model carries out part of speech, syntactic structure and the identification for naming entity, mark based on participle segment of the mode of dictionary pattern matching to input After note, according to default rule, the screening of core participle segment and non-core participle segment is carried out, the core in participle segment is obtained The heart segments segment.
Rule in the embodiment of the present invention, such as in natural language rule model be meet sentence element be attribute/subject/ The adverbial modifier, and part of speech is that noun/proper noun/time word is determined as core participle segment, when the participle segment of input meets certainly Rule in right language rule model just exports the participle segment and is core participle segment by the participle fragment label, if The participle segment of input does not meet the rule in natural language rule model, just exports the participle segment and by the participle piece segment mark It is denoted as non-core participle segment.
For convenience of description, the core participle segment screened by natural language rule model is claimed for the embodiment of the present invention Segment is segmented for the first core.
S104: text search is carried out using first core participle segment.
In the embodiment of the present invention, it can use the first core participle segment and carry out text search in a manner of inverted index, The document that will match to shows user according to certain sequence.
Wherein, when going to search index library using the first core participle segment, all retrieve must be contained in obtained text First core segments segment, if also including non-core point in addition to segmenting segment comprising the first core in the text that retrieval obtains Word segment, then collating sequence of the text in the text that retrieval obtains is leaned on compared to the text only comprising the first core participle segment Before.
Through the embodiment of the present invention, it can accurately identify that user searches for the core participle segment needed, it will be apparent that mention Rise search quality.
Fig. 2 is provided in an embodiment of the present invention based on user query word progress another implementation flow chart of text search method. As shown in Fig. 2, being carried out in text search method based on user query word, including step S201 to step S207.Wherein, step S201- step S203 is identical as the step S101- step S103 in Fig. 1, and details are not described herein.Below with regard to step S204- step S207 elaborates.
S204: training pattern trained in advance is called, used in the process of the training pattern is based on natural language description Natural language constitutes attribute and determines, and output includes for determining that the participle segment becomes the weighted value of core participle segment.
In the embodiment of the present invention, in addition to identifying which is in participle segment using preset natural language rule model Core segments outside segment, can also identify that the core in participle segment segments segment using training pattern trained in advance, supplement. The training pattern can be one of probabilistic model, such as the training pattern can be CRF model, be instructed based on general corpus It gets.The training pattern can for example constitute the part of speech in attribute perhaps syntactic structure or name entity according to natural language It determines, or multiple combinations carry out in part of speech, syntactic structure and the name entity being also based in natural language composition attribute It determines.It is understood that the natural language component attribute of training pattern trained in advance in the disclosure is not limited to part of speech, sentence Method structure and name entity, can also be other language element attributes, such as to can be word long or in user query word Position.
The training pattern obtained based on general training participle segment can be carried out part of speech or syntactic structure or Name the identification of entity, label.Multiple combinations progress in part of speech, syntactic structure and name entity can also be carried out to participle segment Identification, label.
In the embodiment of the present invention, training pattern can be determined based on such as under type:
The long all kinds of parts of speech for including in attribute, word, syntactic structure and name entity are constituted to natural language and preset power respectively Weight coefficient, and to all kinds of parts of speech, word is long, syntactic structure and name entity preset score value respectively, according to all kinds of parts of speech, word it is long, The weight coefficient and score value of syntactic structure and name entity determine training pattern.
In one example, training pattern for example can be used under type such as and predefine: according to the processing mode of natural language, The long part of speech in attribute, word, syntactic structure and name entity are constituted based on natural language and carries out the distribution of weight coefficient, and are directed to The progress score value such as verb, noun, adjective is default in part of speech, for example, long for word is, for example, 2, word a length of 3 carries out respectively Score value is default, and it is default to carry out score value respectively for such as attribute, the adverbial modifier, subject in syntactic structure, and for example for name entity Name, place name, mechanism name carry out score value respectively and preset, and recycle formula: S=∑ (a*S1+b*S2+c*S3+d*S4) obtains often The weighted value of a participle segment.
Wherein, S is the weighted value for segmenting segment, and a, b, c and d are weight coefficient, and S1, S2, S3 and S4 are and weight coefficient Corresponding natural language constitutes the score value of attribute.
It is understood that in another example, training pattern for example can be used under type such as and predefine: according to nature The processing mode of language is constituted the part of speech in attribute based on natural language, word length, carries out the distribution of weight coefficient, and is directed to word Property in the progress score such as verb, noun, adjective it is default, long for word is, for example, 2, and it is default that word a length of 3 carries out score, benefit The weighted value of each participle segment is determined with formula: S=∑ (a*S1+b*S2).
The determination of training pattern can be carried out setting according to practical application and training debugging, the embodiment of the present invention do not limit herein It is fixed.
S205: using the participle segment as the input parameter of the training pattern, and according to the defeated of the training pattern Determine that the participle segment becomes the weighted value of core participle segment out.
In the embodiment of the present invention, training pattern trained in advance is called, weighted value is carried out to participle segment and is determined, and is defeated Out, according to the weighted value of each participle segment of output, determine that core segments segment.
S206: become the weighted value of core participle segment according to the participle segment, determine that the second core segments segment, institute Stating in the second core participle segment includes that first core segments segment.
In the embodiment of the present invention, such as the quantity of core participle segment, the participle segment that training pattern is exported can be preset Weighted value choose from high to low three as core segment segment, and core participle segment in include the first core segment piece Section, or can preset output weighted value be greater than some numerical value participle segment as core segment segment, the core participle piece It also include that the first core segments segment in section.
The embodiment of the present invention for convenience of description, will become the weighted value of core participle segment according to participle segment, determine Core segments segment, i.e., the core participle segment including the first core participle segment is known as the second core participle segment.
S207: text search is carried out using second core participle segment.
In the embodiment of the present invention, segment is segmented using the second core of natural language rule model output, with the row's of falling rope The mode drawn carries out text search, and the document that will match to shows user according to certain sequence.
Through the embodiment of the present invention, more cores participle segments be can get, will include the of the first core participle segment Two cores segment segment and carry out text search, further improve the accuracy of search, and the weighted value based on participle segment is come really When determining core participle segment, does not need specially to construct training data, reduce trained data volume, and preset in training pattern Weight rule can be carried out according to practical application set and training debugging, facilitate adjusting and optimizing, improve intervention optimization efficiency.
As an embodiment of the invention, text search method is carried out in step S201- based on user query word It further include step S208 on the basis of S207.
S208: the quantity of confirmation the first core participle segment is not up to preset quantity threshold value, and second core segments piece The quantity of section is the preset quantity threshold value.
In the embodiment of the present invention, core participle number of fragments threshold value can be preset, if the quantity of the first core participle segment Less than preset quantity threshold value, then using training pattern trained in advance, the determination of core participle segment is carried out simultaneously to participle segment Increase, obtain the second core participle segment, and determines that the quantity of the second core participle segment is preset quantity threshold value.
For example, participle segment is " winter ", " Russia ", " joyful ", " ", preset quantity threshold value is 2, wherein natural Language rule model will be determined as core participle segment in " winter ", and " Russia " is determined as non-core participle segment, core point Word number of fragments can determine the participle segment " winter according to the rule set in training pattern trained in advance at this time less than 2 It " " Russia ", " joyful ", the weighted value of " ", such as weighted value is followed successively by " winter " " Russia " " joyful " from high to low " " then will include that the first core participle segment " winter " and participle segment " Russia " are determined as the second core participle piece Section.
Through the embodiment of the present invention, enough core participle segments can be got, needs can be accurately positioned in the search The document of search.
Natural language rule model involved in the embodiment of the present invention can preset natural language rule according to such as under type Then name entity, part of speech and the syntactic structure in model, preset name entity in natural language rule model, part of speech and Syntactic structure can be determined by step S209-S211.Step S209-S211 is elaborated below.
S209: the name entity in natural language rule model is determined.
Name entity in the embodiment of the present invention, which can consider, has certain sense in the user query word of user's input Entity mainly includes name, place name, mechanism name, proper noun etc..
In the embodiment of the present invention, under type such as can be used and predefine name entity: based on name entity trained in advance Model and preset name Entities Matching rule are named Entity recognition to each participle segment respectively;In name physical model When identifying to obtain name entity with name Entities Matching rule alternative one, the name entity that identification obtains is determined as segmenting The name entity of segment;When name physical model and name Entities Matching rule identify to obtain name entity, it will name The name entity that Entities Matching rule identifies is determined as segmenting the name entity of segment.
Name physical model in the embodiment of the present invention can be based on name Entity recognition (Named Entity Recognition, NER) model and/or condition random field (ConditionalRandom Fields, CRF) model be using general Training obtains corpus in advance.Trained name physical model has versatility, popularity in advance.
Name entity rule library can be collected in the embodiment of the present invention according to actual demands such as website industry or business, Entity is named to participle segment with name Entities Matching rule, that is, dictionary pattern matching rule based on name entity rule Cooley Identification, label.Such as the name entity rule library collected according to tour field place name mechanism name has the special of tourism industry Property, it is fewer but better.
It can be with complementary determination by trained in advance name physical model and preset name Entities Matching rule Name entity.Since name entity rule library is collected according to actual demands such as industry or business, based on name entity rule Then Cooley is named the identification of entity with name Entities Matching rule to participle segment, and the accuracy of label is very high, if life Name physical model and name Entities Matching rule have identification as a result, can be real by the name for naming Entities Matching rule to identify Body is determined as segmenting the name entity of segment.If name physical model and name Entities Matching rule alternative one identify to obtain When naming entity, the name entity for naming Entities Matching rule to identify is determined as to segment the name entity of segment.
Such as participle segment " Russia " is identified as place name by name physical model trained in advance, and by preset Name Entities Matching rule is identified as country name, therefore, segments the identification of segment " Russia " finally with preset name entity The country name identified in matching rule is determined as segmenting the name entity of segment " Russia ".
S210: the part of speech in natural language rule model is determined.
Part of speech in the embodiment of the present invention may include such as noun, adjective, preposition, time word, verb and adverbial word.
In the embodiment of the present invention, under type such as can be used and predefine part of speech: based on part-of-speech tagging model trained in advance Part of speech identification is carried out to each participle segment with preset part of speech matching rule;In part-of-speech tagging model and part of speech matching rule two When one of person identification obtains part of speech, the part of speech that identification obtains is determined as to segment the part of speech of segment;In part-of-speech tagging model and Part of speech matching rule identifies obtain part of speech when, by the part of speech that part of speech matching rule identifies be determined as segment segment word Property.
A CRF part-of-speech tagging model can be trained in advance using general corpus based on CRF model in the embodiment of the present invention, The CRF part-of-speech tagging model has versatility, popularity.
In the embodiment of the present invention for the part of speech for correcting training pattern output situation inaccurate with artificial common sense compared with and Part-of-speech rule library is established, word is carried out to participle segment using part of speech matching rule, that is, dictionary pattern matching rule based on part of speech rule base Property identification, label.Part of speech matching rule base is collected according to natural language common sense, fewer but better.
It can be with complementary determining participle by trained part-of-speech tagging model in advance and preset part of speech matching rule The part of speech of segment.Since part-of-speech rule library is collected according to natural language common sense, utilized based on part of speech rule base preset Part of speech matching rule carries out the identification of part of speech to participle segment, and the accuracy of label is very high, if part-of-speech tagging model and part of speech Matching rule has identification as a result, the part of speech that can identify part of speech matching rule is determined as segmenting the part of speech of segment.If When part-of-speech tagging model and part of speech matching rule alternative one identify to obtain part of speech, word that part of speech matching rule is identified Property be determined as segment segment part of speech.
For example, part-of-speech tagging model, which will segment segment " winter ", is identified as time word, participle segment " Russia " is known Not Wei noun, participle segment " joyful " is identified as noun, participle segment " " is identified as modal particle, and " winter " its actually More specifical season time word by being identified as season word in part of speech matching rule, therefore finally segments segment " winter " Identification is determined as segmenting the part of speech in segment " winter " with the season word identified in preset part of speech matching rule.
S211: the syntactic structure in natural language rule model is determined.
Syntactic structure in the embodiment of the present invention may include the sentence element of e.g. subject, predicate, object and attribute Deng.
In the embodiment of the present invention, under type such as can be used and predefine syntactic structure: based on syntactic structure trained in advance Model and preset syntactic structure matching rule carry out syntactic structure identification to each participle segment;In syntax structure model and sentence When method structure matching rule alternative one identifies to obtain syntactic structure, the part of speech that identification obtains is determined as to segment the sentence of segment Method structure;When syntax structure model and syntactic structure matching rule identify to obtain syntactic structure, syntactic structure is matched The recognition result of rule is determined as segmenting the syntactic structure of segment.
A CRF syntax structure model can be trained in advance using general corpus based on CRF model in the embodiment of the present invention, The CRF syntax structure model has versatility, popularity.
The feelings for syntactic structure inaccuracy compared with artificial common sense that training pattern exports are corrected in the embodiment of the present invention Condition and establish syntactic structure rule base, rule base based on syntactic structure utilize syntactic structure matching rule, that is, dictionary pattern matching rule The identification of part of speech, label are carried out to participle segment.Syntactic structure matching rule according to natural language common sense collect, less and Essence.
It can be with complementary determination by trained in advance syntax structure model and preset syntactic structure matching rule Segment the syntactic structure of segment.It is based on syntactic structure since syntactic structure rule base is collected according to natural language common sense Rule base carries out the identification of part of speech using preset syntactic structure matching rule to participle segment, and the accuracy of label is very high, such as The syntax structure model and preset syntactic structure matching rule that fruit is trained in advance have identification as a result, can match syntactic structure The syntactic structure that rule identification obtains is determined as segmenting the syntactic structure of segment.If syntax structure model and syntactic structure matching When regular alternative one identifies to obtain syntactic structure, the syntactic structure that syntactic structure matching rule identifies is determined as point The syntactic structure of word segment.
For example, time word must be time adverbial, place name must be subject or point adverbial etc., if meeting regular feelings Condition just intervenes covering using syntactic structure matching rule by force.So participle segment " winter ", " Russia ", " joyful ", " " " winter " is identified as point adverbial, and " Russia " is identified as subject.
Through the embodiment of the present invention, it can accurately identify that the name of participle segment is real based on natural language processing mode Body, part of speech and syntactic structure are conducive to name entity, part of speech and syntactic structure by segmenting segment, determine that core segments piece Section, and then improve text search precision.
It is provided in an embodiment of the present invention that text search method is carried out based on user query word, pass through preset natural language Say that rule model and training pattern trained in advance carry out the determination of core participle segment to participle segment, and based on determining core The heart analyzes segment and carries out text search, and text search precision can be improved.
Based on identical inventive concept, the embodiment of the invention also provides one kind to carry out text search based on user query word Device, as shown in figure 3, provided in an embodiment of the present invention carry out text search device 300, the device 300 based on user query word Include: participle unit 301, for segmenting to user query word, obtains participle segment;Call unit 302, it is pre- for calling The natural language rule model first set, the natural language rule model is based on the part of speech in natural language composition attribute, sentence At least one of method structure and name entity predefine, and exporting includes core participle segment or non-core participle segment; Processing unit 303, for using the participle segment as the input parameter of the natural language rule model, and according to it is described from The participle segment is screened in the output of right language model, obtains the first core participle segment;Search unit 304, is used for Text search is carried out using first core participle segment.
In one embodiment, the call unit 302 is also used to: calling training pattern trained in advance, the trained mould Type constitutes at least one of part of speech, word length, syntactic structure and name entity in attribute based on the natural language in advance really It is fixed, and output includes for determining that the participle segment becomes the weighted value of core participle segment;Using the participle segment as The input parameter of the training pattern, and the output according to the training pattern determines that the participle segment becomes core and segments piece The weighted value of section;The weighted value for becoming core participle segment according to the participle segment determines that the second core segments segment, described It includes that first core segments segment that second core, which segments in segment,;Text is carried out using second core participle segment to search Rope.
In one embodiment, the participle unit 301 is also used to: the quantity of confirmation the first core participle segment does not reach To preset quantity threshold value;The quantity of the second core participle segment is the preset quantity threshold value.
In one embodiment, the processing unit 303 is also used to: the name entity is predefined in the following way: Each participle segment is carried out respectively based on trained name physical model in advance and preset name Entities Matching rule Name Entity recognition;It identifies to obtain name entity in the name physical model and the name Entities Matching rule alternative one When, the name entity that identification obtains is determined as to the name entity of the participle segment;In the name physical model and institute It states name Entities Matching rule to identify when obtaining name entity, the name that the name Entities Matching rule is identified Entity is determined as the name entity of the participle segment.
In one embodiment, the processing unit 303 is also used to: being predefined the part of speech in the following way: being based on Trained part-of-speech tagging model and preset part of speech matching rule carry out part of speech identification to each participle segment in advance;Institute When predicate marking model and the part of speech matching rule alternative one identify to obtain part of speech, the part of speech that identification is obtained is determined For the part of speech of the participle segment;When the part-of-speech tagging model and the part of speech matching rule identify and obtain part of speech, The part of speech that the part of speech matching rule identifies is determined as to the part of speech of the participle segment.
In one embodiment, the processing unit 303 is also used to: the syntactic structure is predefined in the following way: Syntax is carried out to each participle segment based on trained syntax structure model in advance and preset syntactic structure matching rule Structure recognition;The syntax structure model and the syntactic structure matching rule alternative one identify to obtain syntactic structure it The part of speech that identification obtains is determined as the syntactic structure of the participle segment by border;In the syntax structure model and the syntax Structure matching rule identifies obtain syntactic structure when, the recognition result of the syntactic structure matching rule is determined as described Segment the syntactic structure of segment.
The embodiment of the present invention also provides a kind of electronic equipment, and electronic equipment includes: memory, for storing instruction;And Processor, any method in the above-mentioned possible embodiment of instruction execution for calling memory to store.
The embodiment of the present invention also provides a kind of computer readable storage medium, and the computer-readable recording medium storage has Computer executable instructions, the computer executable instructions when executed by the processor, execute above-mentioned possible embodiment In any method.
It is understood that the statements such as " first " although used herein, " second " describe embodiments of the present invention Disparate modules, step and data etc., still the statement such as " first ", " second " is merely in different modules, step sum number It is distinguished according between grade, and is not offered as specific sequence or significance level.In fact, " first ", " second " etc. are stated It may be used interchangeably completely.
Although it will be further understood that in the embodiment of the present invention in the accompanying drawings in a particular order description operation, It is that should not be construed as requiring particular order or serial order shown in execute these operations, or require to execute Operation is shown in whole to obtain desired result.In specific environment, multitask and parallel processing may be advantageous.
The present embodiments relate to method and apparatus can be completed using standard programming technology, utilization is rule-based Logic or other logics realize various method and steps.It should also be noted that herein and used in claims Word " device " and " module " are intended to include using the realization of a line or multirow software code and/or hardware realization and/or use In the equipment for receiving input.
One or more combined individually or with other equipment can be used in any step, operation or program described herein A hardware or software module are executed or are realized.In one embodiment, software module use includes comprising computer program The computer program product of the computer-readable medium of code is realized, can be executed by computer processor any for executing Or whole described step, operation or programs.
For the purpose of example and description, the preceding description that the present invention is implemented is had been presented for.Preceding description is not poor Also not the really wanting of act property limits the invention to exact form disclosed, according to the above instruction there is likely to be various modifications and Modification, or various changes and modifications may be obtained from the practice of the present invention.Select and describe these embodiments and be in order to Illustrate the principle of the present invention and its practical application, so that those skilled in the art can be to be suitable for the special-purpose conceived Come in a variety of embodiments with various modifications and utilize the present invention.

Claims (10)

1. one kind carries out text search method based on user query word, wherein the described method includes:
User query word is segmented, participle segment is obtained;
Preset natural language rule model is called, the natural language rule model is based on natural language and constitutes in attribute Part of speech, syntactic structure and name at least one of entity predefine, and exporting includes that core segments segment or non-core Segment segment;
Using the participle segment as the input parameter of the natural language rule model, and according to the natural language model The participle segment is screened in output, obtains the first core participle segment;
Text search is carried out using first core participle segment.
2. according to the method described in claim 1, wherein, the method also includes:
Call training pattern trained in advance, the training pattern constituted based on the natural language part of speech in attribute, word it is long, At least one of syntactic structure and name entity predefine, and output includes for determining that the participle segment becomes core Segment the weighted value of segment;
Using the participle segment as the input parameter of the training pattern, and according to described in the output of training pattern determination Segmenting segment becomes the weighted value of core participle segment;
The weighted value for becoming core participle segment according to the participle segment, determines that the second core segments segment, second core It includes that first core segments segment that the heart, which segments in segment,;
Text search is carried out using second core participle segment.
3. according to the method described in claim 2, wherein, the method also includes:
Confirm that the quantity of the first core participle segment is not up to preset quantity threshold value;
The quantity of the second core participle segment is the preset quantity threshold value.
4. method according to claim 1 or 2, wherein the method also includes:
The name entity is predefined in the following way:
Based on trained name physical model and preset name Entities Matching rule in advance respectively to each participle segment It is named Entity recognition;
When the name physical model and the name Entities Matching rule alternative one identify to obtain name entity, it will know The name entity not obtained is determined as the name entity of the participle segment;
The name physical model and the name Entities Matching rule identify obtain name entity when, by the name The name entity that Entities Matching rule identifies is determined as the name entity of the participle segment.
5. method according to claim 1 or 2, wherein the method also includes:
The part of speech is predefined in the following way:
Part of speech is carried out to each participle segment based on trained part-of-speech tagging model in advance and preset part of speech matching rule Identification;
When the part-of-speech tagging model and the part of speech matching rule alternative one identify to obtain part of speech, identification is obtained Part of speech is determined as the part of speech of the participle segment;
When the part-of-speech tagging model and the part of speech matching rule identify and obtain part of speech, by the part of speech matching rule Identify that obtained part of speech is determined as the part of speech of the participle segment.
6. method according to claim 1 or 2, wherein the method also includes:
The syntactic structure is predefined in the following way:
Each participle segment is carried out based on trained syntax structure model in advance and preset syntactic structure matching rule Syntactic structure identification;
When the syntax structure model and the syntactic structure matching rule alternative one identify to obtain syntactic structure, it will know The part of speech not obtained is determined as the syntactic structure of the participle segment;
When the syntax structure model and the syntactic structure matching rule identify and obtain syntactic structure, by the syntax The recognition result of structure matching rule is determined as the syntactic structure of the participle segment.
7. one kind carries out text search device based on user query word, wherein described device includes:
Participle unit obtains participle segment for segmenting to user query word;
Call unit, for calling preset natural language rule model, the natural language rule model is based on nature Language constitutes at least one of part of speech, syntactic structure and name entity in attribute and predefines, and exports including core point Word segment or non-core participle segment;
Processing unit, for using the participle segment as the input parameter of the natural language rule model, and according to described The participle segment is screened in the output of natural language model, obtains the first core participle segment;
Search unit, for carrying out text search using first core participle segment.
8. device according to claim 7, wherein the call unit is also used to:
Call training pattern trained in advance, the training pattern constituted based on the natural language part of speech in attribute, word it is long, At least one of syntactic structure and name entity predefine, and output includes for determining that the participle segment becomes core Segment the weighted value of segment;
Using the participle segment as the input parameter of the training pattern, and according to described in the output of training pattern determination Segmenting segment becomes the weighted value of core participle segment;
The weighted value for becoming core participle segment according to the participle segment, determines that the second core segments segment, second core It includes that first core segments segment that the heart, which segments in segment,;
Text search is carried out using second core participle segment.
9. a kind of electronic equipment, wherein the electronic equipment includes:
Memory, for storing instruction;And
Processor, for calling the instruction execution of the memory storage described in any one of claims 1 to 6 based on user Query word carries out text search method.
10. a kind of computer readable storage medium, wherein the computer-readable recording medium storage has computer is executable to refer to Enable, when the computer executable instructions are run on computers, perform claim require any one of 1 to 6 described in based on Family query word carries out text search method.
CN201910544979.1A 2019-06-21 2019-06-21 Text search method and device is carried out based on user query word Pending CN110263127A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910544979.1A CN110263127A (en) 2019-06-21 2019-06-21 Text search method and device is carried out based on user query word

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910544979.1A CN110263127A (en) 2019-06-21 2019-06-21 Text search method and device is carried out based on user query word

Publications (1)

Publication Number Publication Date
CN110263127A true CN110263127A (en) 2019-09-20

Family

ID=67920439

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910544979.1A Pending CN110263127A (en) 2019-06-21 2019-06-21 Text search method and device is carried out based on user query word

Country Status (1)

Country Link
CN (1) CN110263127A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111104488A (en) * 2019-12-30 2020-05-05 广州广电运通信息科技有限公司 Method, device and storage medium for integrating retrieval and similarity analysis
CN111159343A (en) * 2019-12-26 2020-05-15 上海科技发展有限公司 Text similarity searching method, device, equipment and medium based on text embedding
CN111737974A (en) * 2020-08-18 2020-10-02 北京擎盾信息科技有限公司 Semantic abstract representation method and device for statement
CN111931480A (en) * 2020-07-03 2020-11-13 北京新联财通咨询有限公司 Method and device for determining main content of text, storage medium and computer equipment
CN111986768A (en) * 2020-09-03 2020-11-24 平安国际智慧城市科技股份有限公司 Clinic query report generation method and device, electronic equipment and storage medium

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101510221A (en) * 2009-02-17 2009-08-19 北京大学 Enquiry statement analytical method and system for information retrieval
CN102591932A (en) * 2011-12-23 2012-07-18 优视科技有限公司 Voice search method, voice search system, mobile terminal and transfer server
CN102929925A (en) * 2012-09-20 2013-02-13 百度在线网络技术(北京)有限公司 Search method and device based on browsing content
CN103123624A (en) * 2011-11-18 2013-05-29 阿里巴巴集团控股有限公司 Method of confirming head word, device of confirming head word, searching method and device
CN103425691A (en) * 2012-05-22 2013-12-04 阿里巴巴集团控股有限公司 Search method and search system
CN103886063A (en) * 2014-03-18 2014-06-25 国家电网公司 Text retrieval method and device
CN104598445A (en) * 2013-11-01 2015-05-06 腾讯科技(深圳)有限公司 Automatic question-answering system and method
US9116977B2 (en) * 2011-10-10 2015-08-25 Alibaba Group Holding Limited Searching information
CN105893533A (en) * 2016-03-31 2016-08-24 北京奇艺世纪科技有限公司 Text matching method and device
CN109033305A (en) * 2018-07-16 2018-12-18 深圳前海微众银行股份有限公司 Question answering method, equipment and computer readable storage medium
CN109582962A (en) * 2018-11-28 2019-04-05 北京创鑫旅程网络技术有限公司 Segmenting method and device
CN109815396A (en) * 2019-01-16 2019-05-28 北京搜狗科技发展有限公司 Search term Weight Determination and device

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101510221A (en) * 2009-02-17 2009-08-19 北京大学 Enquiry statement analytical method and system for information retrieval
US9116977B2 (en) * 2011-10-10 2015-08-25 Alibaba Group Holding Limited Searching information
CN103123624A (en) * 2011-11-18 2013-05-29 阿里巴巴集团控股有限公司 Method of confirming head word, device of confirming head word, searching method and device
CN102591932A (en) * 2011-12-23 2012-07-18 优视科技有限公司 Voice search method, voice search system, mobile terminal and transfer server
CN103425691A (en) * 2012-05-22 2013-12-04 阿里巴巴集团控股有限公司 Search method and search system
CN102929925A (en) * 2012-09-20 2013-02-13 百度在线网络技术(北京)有限公司 Search method and device based on browsing content
CN104598445A (en) * 2013-11-01 2015-05-06 腾讯科技(深圳)有限公司 Automatic question-answering system and method
CN103886063A (en) * 2014-03-18 2014-06-25 国家电网公司 Text retrieval method and device
CN105893533A (en) * 2016-03-31 2016-08-24 北京奇艺世纪科技有限公司 Text matching method and device
CN109033305A (en) * 2018-07-16 2018-12-18 深圳前海微众银行股份有限公司 Question answering method, equipment and computer readable storage medium
CN109582962A (en) * 2018-11-28 2019-04-05 北京创鑫旅程网络技术有限公司 Segmenting method and device
CN109815396A (en) * 2019-01-16 2019-05-28 北京搜狗科技发展有限公司 Search term Weight Determination and device

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111159343A (en) * 2019-12-26 2020-05-15 上海科技发展有限公司 Text similarity searching method, device, equipment and medium based on text embedding
CN111104488A (en) * 2019-12-30 2020-05-05 广州广电运通信息科技有限公司 Method, device and storage medium for integrating retrieval and similarity analysis
CN111104488B (en) * 2019-12-30 2023-10-24 广州广电运通信息科技有限公司 Method, device and storage medium for integrating retrieval and similarity analysis
CN111931480A (en) * 2020-07-03 2020-11-13 北京新联财通咨询有限公司 Method and device for determining main content of text, storage medium and computer equipment
CN111931480B (en) * 2020-07-03 2023-07-18 北京新联财通咨询有限公司 Text main content determining method and device, storage medium and computer equipment
CN111737974A (en) * 2020-08-18 2020-10-02 北京擎盾信息科技有限公司 Semantic abstract representation method and device for statement
CN111737974B (en) * 2020-08-18 2020-12-04 北京擎盾信息科技有限公司 Semantic abstract representation method and device for statement
CN111986768A (en) * 2020-09-03 2020-11-24 平安国际智慧城市科技股份有限公司 Clinic query report generation method and device, electronic equipment and storage medium
CN111986768B (en) * 2020-09-03 2023-06-09 深圳平安智慧医健科技有限公司 Method and device for generating query report of clinic, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN108829893B (en) Method and device for determining video label, storage medium and terminal equipment
CN108363790B (en) Method, device, equipment and storage medium for evaluating comments
CN111444320B (en) Text retrieval method and device, computer equipment and storage medium
CN110263127A (en) Text search method and device is carried out based on user query word
CN110019732B (en) Intelligent question answering method and related device
CN110427463A (en) Search statement response method, device and server and storage medium
CN110334209B (en) Text classification method, device, medium and electronic equipment
WO2011037603A1 (en) Searching for information based on generic attributes of the query
CN103914533B (en) That promotes search result shows method and apparatus
US11651014B2 (en) Source code retrieval
CN110334186A (en) Data query method, apparatus, computer equipment and computer readable storage medium
CN111400584A (en) Association word recommendation method and device, computer equipment and storage medium
CN111753167B (en) Search processing method, device, computer equipment and medium
CN115526171A (en) Intention identification method, device, equipment and computer readable storage medium
CN109657043B (en) Method, device and equipment for automatically generating article and storage medium
CN116151220A (en) Word segmentation model training method, word segmentation processing method and device
CN114202443A (en) Policy classification method, device, equipment and storage medium
CN111881264B (en) Method and electronic equipment for searching long text in question-answering task in open field
CN117194616A (en) Knowledge query method and device for vertical domain knowledge graph, computer equipment and storage medium
CN112287656A (en) Text comparison method, device, equipment and storage medium
CN106776590A (en) A kind of method and system for obtaining entry translation
CN110705285A (en) Government affair text subject word bank construction method, device, server and readable storage medium
CN115964474A (en) Policy keyword extraction method and device, storage medium and electronic equipment
CN110874408A (en) Model training method, text recognition device and computing equipment
CN115203206A (en) Data content searching method and device, computer equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination