CN106874294A - A kind of information matching method and device - Google Patents

A kind of information matching method and device Download PDF

Info

Publication number
CN106874294A
CN106874294A CN201510922330.0A CN201510922330A CN106874294A CN 106874294 A CN106874294 A CN 106874294A CN 201510922330 A CN201510922330 A CN 201510922330A CN 106874294 A CN106874294 A CN 106874294A
Authority
CN
China
Prior art keywords
matched
information
state
dfa
character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510922330.0A
Other languages
Chinese (zh)
Inventor
黎久希
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Advantageous New Technologies Co Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201510922330.0A priority Critical patent/CN106874294A/en
Publication of CN106874294A publication Critical patent/CN106874294A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

This application provides a kind of information matching method and device, including:Character sequence according to information to be matched receives the information to be matched;Information to be matched is matched in the advance finite-state machine DFA for building, the DFA is made up of default keyword, the output listing of descendants status of fail of the output listing of each state comprising the state in the DFA, descendants's status of fail of the state is identical with the character of the state;In the matching process, if the NextState that does not match with the character of the information to be matched of current state, the status of fail for jumping to the current state continues to match.Using technical scheme provided herein, matching process is only needed to as the state in DFA is length by length toward next state transition, laststate need not be returned to match other keywords, such that it is able to ensure that the information to be matched completes matching by only needing to scan one time, matching efficiency is drastically increased.

Description

A kind of information matching method and device
Technical field
The application is related to field of computer technology, more particularly to a kind of information matching method and device.
Background technology
With continuing to develop for information technology, increasing application scenarios need to carry out information matches, for example: Term matching, the similarity mode of paper, spell check of search engine etc..
At present, information matches technology is usually and realizes keyword match using the index of functions of java language, That is, file in advance with some keywords is set up, by user in the content that client is input into this document Each contained keyword is matched.Assuming that the file for pre-building includes 100 keywords, then, Then needing the full text scanning for carrying out 100 times could complete the information matches to user input content.In this letter Cease the epoch of blast, the efficiency matched using above-mentioned matching way is very low, it is impossible to adapt to information content compared with Big scene.
Prior art deficiency is:
Existing information matches mode inefficiency, it is impossible to adapt to the larger scene of information content.
The content of the invention
The embodiment of the present application proposes a kind of information matching method and device, to solve information in the prior art With mode inefficiency, it is impossible to adapt to the technical problem of the larger scene of information content.
The embodiment of the present application provides a kind of information matching method, comprises the following steps:
Character sequence according to information to be matched receives the information to be matched;
Information to be matched is matched in the advance finite-state machine DFA for building, the DFA is by pre- If keyword constitute, in the DFA output listing of each state comprising the state descendants fail The output listing of state, descendants's status of fail of the state is identical with the character of the state;
In the matching process, if current state is not next with what the character of the information to be matched matched State, the status of fail for jumping to the current state continues to match.
The embodiment of the present application provides a kind of information matches device, including:
Receiver module, the information to be matched is received for the character sequence according to information to be matched;
Matching module, for information to be matched to be matched in the advance finite-state machine DFA for building, The DFA is made up of default keyword, and the output listing of each state includes the shape in the DFA The output listing of descendants's status of fail of state, descendants's status of fail of the state and the character phase of the state Together;In the matching process, if current state is not next with what the character of the information to be matched matched State, the status of fail for jumping to the current state continues to match.
Have the beneficial effect that:
Information matching method and device that the embodiment of the present application is provided, build be made up of preset keyword in advance Finite-state machine DFA, information to be matched is matched in the DFA, thus keyword With i.e. become redirecting for state machine;Because the output of each state of DFA described in the embodiment of the present application is arranged The output listing of descendants status of fail of the table comprising the state, in the matching process, if current state does not have There is the NextState matched with the character of the information to be matched, then jump to the failure of the current state State continues to match, and the matching process of the information to be matched is only needed to the step one of the DFA states one The past next state transition in step ground, without returning to laststate to match other keywords, such that it is able to ensure State by information to be matched only needs to scan one time and complete matching, drastically increase matching efficiency.
Brief description of the drawings
The specific embodiment of the application is described below with reference to accompanying drawings, wherein:
Fig. 1 shows the schematic flow sheet of information matching method implementation in the embodiment of the present application;
Fig. 2 shows DFA Structure schematic diagram in the embodiment of the present application;
Schematic diagram when Fig. 3 shows that DFA is initialized in the embodiment of the present application;
Fig. 4 shows the structural representation after the completion of DFA builds in the embodiment of the present application;
Fig. 5 shows the DFA schematic diagrames of group chatting limitation in the embodiment of the present application;
Fig. 6 shows the structural representation of information matches device in the embodiment of the present application.
Specific embodiment
In order that the technical scheme and advantage of the application become more apparent, below in conjunction with accompanying drawing to the application's Exemplary embodiment is described in more detail, it is clear that described embodiment is only the one of the application Section Example, rather than the exhaustion of all embodiments.And in the case where not conflicting, in this explanation Feature in embodiment and embodiment can be combined with each other.
In view of the shortcomings of the prior art, the embodiment of the present application proposes a kind of information matching method and device, under Face illustrates.
Fig. 1 shows the schematic flow sheet of information matching method implementation in the embodiment of the present application, as illustrated, Described information matching process may include steps of:
Step 101, the character sequence reception information to be matched according to information to be matched;
Step 102, by information to be matched in advance finite-state machine (DFA, the Deterministic for building Finite Automata) in matched, the DFA is made up of default keyword, in the DFA The output listing of descendants status of fail of the output listing of each state comprising the state, the son of the state Grandson's status of fail is identical with the character of the state;
Step 103, in the matching process, if current state not with the character phase of the information to be matched The NextState of matching, the status of fail for jumping to the current state continues to match.
In specific implementation, the information to be matched can be an article in web blog, or Certain section of word in word document.Promptness needs according to actual scene, the information to be matched can be Existing information, it is also possible in the information that user is input into real time, for example:The information to be matched can be public Open the title of blog article, or the retrieval word that user is input into real time in a search engine.
In addition, information to be matched described in the embodiment of the present application can also be voice, i.e. can receive The voice messaging of user input, by the voice (or being converted into corresponding text) in the advance DFA for building In matched, so as to realize the detection of user's speech content.
The DFA built in advance described in the embodiment of the present application, can be made up of, often default some keywords The character that individual keyword is included can be the state node in the DFA.With certain in the DFA The character identical state of state, can be as descendants's status of fail of the state, the descendants of each state Status of fail can preset when DFA builds and initializes.
In specific matching process, can be matched since the original state of the DFA, if current shape The set of keywords of state includes certain character in the information to be matched, then according to redirect path jump to it is next State, to start the reading of next character in the information to be matched;If current state is not treated with described The NextState that the character of match information matches, then immediately hop to the status of fail of the current state after Continuous matching;If the output listing of current state is not sky, illustrate to have had been matched to corresponding keyword, The output listing of the current state is this newly-increased all set of keywords for matching.
The information matching method that the embodiment of the present application is provided, can be by information to be matched in advance structure Matched in DFA, descendants of the output listing of each state comprising the state fails in the DFA The output listing of state, in the matching process, if current state not with the character of the information to be matched The NextState for matching, the status of fail for jumping to the current state continues to match, the letter to be matched The matching process of breath only needs to in the DFA that each state redirects condition length by length to next State transition, without returning to laststate to match other keywords, such that it is able to ensure the letter to be matched Breath completes matching by only needing to scan one time, drastically increases matching efficiency.
In implementation, methods described may further include:
Export the set of keywords with the information matches to be matched.
For example, default keyword can be " beautiful women suffer unhappy fates ", " dissipated and luxurious ", " wine and women-sensual pursuits " etc., generation DFA can be with as shown in Fig. 2 wherein, dotted line represents the incidence relation of certain state and its status of fail.
When DFA is initialized, only leaf node acquiescence obtains output listing, the output of other non-leaf nodes List is sky, i.e.,:
The outputlist of H is { " beautiful women suffer unhappy fates " };
The outputlist of I is { " dissipated and luxurious " };
The outputlist of J is { " wine and women-sensual pursuits " };
When DFA initialization is finished, associating for each state and its descendants's status of fail can be immediately built System, descendants's status of fail of A condition is E, and descendants's status of fail of G states is C, and obtains it respectively The output listing of descendants's status of fail.
Although A, G have descendants's status of fail in the embodiment of the present application, due to their descendants's failure shape The output listing of state is also sky, therefore the output listing of A, G is still empty.
Assuming that information to be matched is " it is one of method of sampling wine to watch red wine color and luster ", and since start, matching " red ", after jumping to A condition node, the state not matched with next character, so redirecting immediately To E-state, continue to match since E, next character is matched with " wine ", therefore jumps to G state sections Point;After G state nodes are jumped to, the state not matched with next character, so immediately hopping to C-state, continues to match since C, there is the NextState node matched with " color ", therefore jump to J Node, obtains the output listing that output listing is J, that is, export the keyword with the information matches to be matched Collection is combined into " wine and women-sensual pursuits ".
Set of keywords with the information matches to be matched is exported after matching terminates in the embodiment of the present application, Can be supported for the operation such as follow-up statistics, analysis provides data.
In implementation, methods described may further include:
It is determined that position of the keyword of matching in the information to be matched;
Matching times according to keyword position, preset keyword weight and keyword are treated described in calculating Importance degree with information.
In specific implementation, matching when can record each matching keyword in the information to be matched Position, it is assumed that the information to be matched is an article for including title and text, whenever matching a pass During key word, then can record the keyword be in the title division of the article or in the article just Literary part;After matching terminates, the matching times of each keyword are counted, for example:Assuming that there is " wine in title There is " wine and women-sensual pursuits " 12 times in color " 1 time, text, calculates the importance degree of the article.
Specifically, calculating the importance degree of the information to be matched, keyword grade set in advance can be combined / give a mark to realize, for example:It is 0.8, text to preset keyword " wine and women-sensual pursuits " score value when title occurs Score value is 0.2 during appearance, then, the importance degree of the article can be 0.8*1+0.2*12=3.2.
The embodiment of the present application can determine position of the keyword of matching in the information to be matched in matching Put, the matching times of keyword, the important of the information to be matched is obtained with reference to preset keyword weight Degree.
In implementation, after the character sequence according to information to be matched receives the information to be matched, will Before information to be matched is matched in finite-state machine DFA, methods described may further include:
Pretreatment operation is carried out to the information to be matched.
In the embodiment of the present application, when the information to be matched is received, can first to the letter to be matched Breath carries out some pretreatment operations, in order to follow-up matching of the information to be matched in finite-state machine, Improve matching efficiency or precision.
In implementation, the pretreatment operation can include following at least one:Chinese and English conversion, Chinese-character phonetic letter Conversion, either traditional and simplified characters conversion, the conversion of full-shape half-angle, Digital size write conversion, removal interference character.
Wherein, interference character can be the characters such as label symbol, html labels.
During specific implementation, because same word can be marked by multi-form, preapred for an unfavorable turn of events to improve Certain in the information to be matched or alphabet can be carried out Chinese and English by the ability of kind, the embodiment of the present application Conversion, Chinese-character phonetic letter conversion, either traditional and simplified characters conversion, the conversion of full-shape half-angle, Digital size write conversion etc..For example: Preset keyword is " wine se " in including " wine and women-sensual pursuits ", but information to be matched, and now, the embodiment of the present application can To be matched after " se " is converted into " color ".
In the embodiment of the present application, after the information to be matched is received, the information to be matched can be entered Chinese and English of going is changed, Chinese-character phonetic letter is changed, either traditional and simplified characters are changed, full-shape half-angle is changed, Digital size writes conversion, The pretreatment operations such as removal interference character, so as to prevent keyword from causing the situation that it fails to match due to mutation Occur, the success rate of matching is improved to a certain extent, strengthen the anti-mutation abilities of DFA.
In implementation, before the character sequence according to information to be matched receives the information to be matched, institute The method of stating may further include:
Finite-state machine DFA, all leaves when the DFA is initialized are built according to the keyword for pre-setting Child node obtains respective output listing respectively;
The incidence relation of each state and its descendants's status of fail in the DFA is set up, each state obtains institute State the output listing of descendants's status of fail of state.
Finite-state machine can be in advance built in the embodiment of the present application, the information to be matched that then will be received again Matched in the finite-state machine.
In the embodiment of the present application, the data structure of finite-state machine can include:It is all when the match is successful to redirect State (gotoMap), next when it fails to match redirect state (failureState) and Keyword List (outputlist)。
During specific implementation, failurestate building process can be as follows:
It is possible, firstly, to create a chained list LinkedList;
Top-down in the DFA to scan for, search includes analog information with each state therein Node.
If through search, have found the node comprising analog information with certain state, then the node can be made For the descendants failureState of the state is added in the chained list, by each state State in DFA The corresponding output listing outputlist of failureState are added in the state outputlist of oneself.
If through search, do not found and node of certain state comprising analog information, then, can give tacit consent to The failureState of the state is set for startState, restarts matching status.
In search procedure, can set and jump out circulation when the failureState is beginning state, so that Search is avoided to be absorbed in endless loop.
By after said process, the whole descendants failureState of each state has been added to the chained list In, wherein, chain table pointer gets over close with head node closer to the information that head node its corresponding state is included.
FailureState can be set when DFA is initialized based on the relevance of itself between sensitive word, can To complete the reconstruction of DFA by breadth first search.
The embodiment of the present application by matching cannot continue in the case of jump to failurestate continue match, Matching is completed by ensure whole textual scan one time, it is only necessary to go ahead, without returning.
Matching process in the embodiment of the present application can be as follows:
First, initialization operation is carried out when matching is started, current state curState can be set to just Beginning state startState, it is 0 to set variable i initial value;
Matching process is proceeded by from variable i=0, is performed if DFA builds and there is circulation and is jumped out matching behaviour Make;If DFA is built in the absence of circulation, can be matched since original state, until variable i=text This length text.length.
In the matching process, if current state set of keywords includes a certain character in text, then, then NextState is jumped to according to path is redirected, to start the reading of next character in text.
If the output listing of current state is not sky, illustrate that this matching process has had been matched to accordingly Keyword, the output listing outputlist therein exactly newly-increased all keywords for matching of epicycle matching Set, the newly-increased keyword match information of record epicycle, wherein keyword can be included in itself and matched position.
The embodiment of the present application can be matched in matching from original state startstate, if current shape Comprising a certain character in information to be matched in state set of keywords, then basis redirects path and jumps to next shape State, starts the reading of next character of the information to be matched, if original state always point to it is described initial The status of fail of state, then prompting " DFA builds has circulation " warning.
Continue matching according to the above-mentioned mode that redirects, if reaching certain state (is read into the information to be matched In certain character) when, the output listing of the state is not sky, then explanation has been matched to corresponding keyword, The output listing of the state is all set of keywords that epicycle matching process is obtained, and epicycle matching terminates. Keyword and the position of keyword that record epicycle matching is obtained.
In specific implementation, can again exist since next character of the current character of the information to be matched Startstate starts matching in DFA.
The information matching method that the embodiment of the present application is provided, can first according to the keyword structure for pre-setting Finite-state machine is built, each state obtains the output row of its descendants's status of fail in the finite-state machine Table;Then information to be matched is received, treating match information is carried out the information to be matched after pretreatment operation Matched since the original state of the finite-state machine;Institute is jumped directly to when it fails to match in certain state The descendants's status of fail for stating state continues to match, the state before without returning to the state, until being deposited In the state of output listing, the set of keywords included in the output listing is exported, epicycle matching terminates; If also existing after the current character of now described information to be matched and not matching character, will be described to be matched Character in information after current character reopens the matching of a new round since the original state of state machine.
For the ease of the implementation of the application, illustrated with example below.
Embodiment one,
By taking sensitive word in filtering text this application scenarios as an example, it is described as follows:
Sensitive word, generally refers to sensitive political orientation (anti-ruling party's tendency), violent tenet, is not good for Health color or uncultivated word, it is also possible to refer to some enterprises according to own actual situation, sets itself it is suitable For the special violated word of enterprises.
Assuming that sensitive word is:Very evil, very strange, evil, evil, heresy;
This 5 sensitive words are initialized as the DFA that keyword is constituted during Fig. 3 shows the embodiment of the present application When schematic diagram, as illustrated, the DFA initialize when all of leaf node can give tacit consent to acquisition Outputlist, the outputlist of other nodes is sky.
The outputlist of each leaf node is as follows:
The outputList of C-state is { " evil " };
The outputList of E is { " evilness " };
The outputList of G is { " very strange " };
The outputList of H is { " very evil " };
The outputList of I is { " heresy " }.
When DFA initialization is finished, the failurestate of each state can be immediately built, each state can To obtain the outputlist of its descendants failurestate.
Fig. 4 shows the structural representation after the completion of DFA builds in the embodiment of the present application, with void in Fig. 4 Line represents the relation of state and its failurestate, as illustrated, the descendants failurestate of D-state is B, The descendants failurestate of F states is E, C.
Because the outputList of C-state is { " evil " }, C is the descendants failurestate of E, therefore, E Outputlist be changed into { " evilness ", " evil " };
Because the outputlist of E is changed into { " evilness ", " evil " }, E is the descendants failurestate of F, because This, the outputList of F is changed into { " evilness ", " evil " }.
Although B is the descendants failurestate of D, because B state is non-leaf nodes, outputlist It is sky, so, the outputList of D is still sky.
Assuming that text to be matched is:
1) this thing is really very evil exhales
2) the genuine very evil property of this thing is exhaled
When to whole text, " this thing is really very evil to exhale" matching when, character " very " and A condition Match somebody with somebody, the next stage of A condition has the state (i.e. D-state) matched with " heresy ", therefore jumps to D;D There is the state (i.e. F states) matched with " evil " in the next stage of state, therefore jump to F states;Go to F states, due to the state that the next stage of F states is not matched with " exhaling ", matching is not gone down, is jumped immediately E is gone to, the outputList { " evilness ", " evil " } of E is directly obtained.The key that output matching is obtained Word is evil, evil.
When to whole text, " the genuine very evil property of this thing is exhaled" matching when, character " very " and A condition Match somebody with somebody, the next stage of A condition has the state (i.e. D-state) matched with " heresy ", therefore jumps to D;Walk To D-state, due to the state that the next stage of D-state is not matched with " property ", matching is not gone down, and is stood B is jumped to, continues to redirect since B.There is the state matched with " property " (i.e. in the next stage of B state I state), therefore I is jumped to, obtain the outputList { " heresy " } of I.The keyword that output matching is obtained Collection is combined into { " heresy " }.
During concrete application, " heresy " this keyword can be preset positioned at the title of document, text Score value, or " heresy " this keyword is preset positioned at the text first section of document, text stage casing, just The different brackets of text ending, may thereby determine that out the importance degree of the document.
Further all information to be matched can also be ranked up according to the importance degree being calculated, set Filter threshold value, if it exceeds the predetermined threshold value, then shielded the document.
In summary it can be seen, the information matching method provided using the embodiment of the present application, for be matched For text, matching process is to go ahead, need not return all the time, and full text run-down can complete matching.
The information matching method provided using the embodiment of the present application, test sample is the blog text of several K words Originally when, sensitive word character library has thousands of, the Mean match performance of test can reach O (ms) rank, efficiency Significantly larger than existing matching algorithm.
Embodiment two,
By taking Internet chat civilization term this application scenarios as an example, it is described as follows:
When actually used, in order to ensure the term civilization of group member, group keeper can in advance by restrictive word Language automatically generates the DFA as provided in the embodiment of the present application as keyword by system, when there is member defeated Shielded when entering the speech with preset keyword.
Fig. 5 shows the DFA schematic diagrames of group chatting limitation in the embodiment of the present application, as illustrated, false If the word that default limitation member uses is:Two goods, brainless, intention, rolling, wherein F's Failurestate is C.
Assuming that group group member A is input into, " 2 goods, whether you lack intention", system is input into according to A Order receives the information to be matched, and (i.e. " 2 goods, whether you lack intention"), then the words is entered Row pretreatment operation, the punctuation mark in the words is removed, obtain the information to be matched for " 2 goods you Whether intention is lacked ", small letter therein numeral " 2 " is further then converted into word figure " two ", Matching is proceeded by from the start of state machine.
A condition node is matched with the first character (i.e. " two ") of information to be matched, is jumped to from start A, then E-state node matched with second character (i.e. " goods ") of information to be matched, redirected from A To E;3rd character of information to be matched is " you ", due to E do not matched with " you " it is next State node, also without failurestate, therefore the outputlist (i.e. " two goods " keyword) of output E, The failurestate for giving tacit consent to E is start, restarts the matching of a new round.
Since start, the character that " you ", "Yes", " no ", "Yes" are not matched is described to be matched 7th character " scarce " and B state node matching in information, therefore jump to B from start;It is described to be matched The 8th character " heart " matches with F state nodes in information, therefore jumps to F from B;F to be jumped to it Afterwards, due to the NextState that F is not matched with the 9th character " eye ", but F has failurestate (i.e. C), therefore, directly from F jump to C continue match;The NextState I of C and the information to be matched 9th character " eye " matching, therefore the outputlist (i.e. " intention " keyword) of output I.
So far, information to be matched completes matching process, and final output set of keywords is combined into { " two goods ", " heart Eye " }.
Because the language of group member A inputs includes above-mentioned two keyword, in order to ensure between group member Harmony, can limit the above-mentioned speech of member A or be further sent out warning.
The scheme that the embodiment of the present application is provided, only needs to from start nodes progressively in each round matching process Redirected to next node, halfway without redirecting return matching to front nodal point, such that it is able to improve matching efficiency, Meet the demand of chat promptness detection;Furthermore it is also possible to treat match information pre-processed, to improve The success rate of matching, prevents from causing the situation that it fails to match to occur due to keyword mutation.
A kind of information matches device is additionally provided based on same inventive concept, in the embodiment of the present application, due to this The principle of a little equipment solve problems is similar to a kind of information matching method, therefore the implementation of these equipment can be joined The implementation of square method, repeats part and repeats no more.
Fig. 6 shows the structural representation of information matches device in the embodiment of the present application, as illustrated, described Information matches device can include:
Receiver module 601, the information to be matched is received for the character sequence according to information to be matched;
A matching module 602, for information to be matched to be carried out in the advance finite-state machine DFA for building Match somebody with somebody, the DFA is made up of default keyword, the output listing of each state includes institute in the DFA State the output listing of descendants's status of fail of state, descendants's status of fail of the state and the word of the state Symbol is identical;In the matching process, if what current state did not match with the character of the information to be matched NextState, the status of fail for jumping to the current state continues to match.
In implementation, described device may further include:
Output module 603, for exporting the set of keywords with the information matches to be matched.
In implementation, described device may further include:
Determining module 604, for determining position of the keyword of matching in the information to be matched;
Computing module 605, for according to keyword position, preset keyword weight and keyword The importance degree of the information to be matched is calculated with number of times.
In implementation, described device may further include:
Pretreatment module 606, for receiving described to be matched in the character sequence according to information to be matched After information, before information to be matched is matched in finite-state machine DFA, to described to be matched Information carries out pretreatment operation.
In implementation, the pretreatment module 606 specifically can be used in the character according to information to be matched Order is received after the information to be matched, and information to be matched is matched in finite-state machine DFA Before, following at least one operation is carried out to the information to be matched:Chinese and English is changed, Chinese-character phonetic letter is changed, Either traditional and simplified characters conversion, the conversion of full-shape half-angle, Digital size write conversion, removal interference character.
In implementation, described device may further include:
State machine models block 607, for being treated described in the character sequence reception according to information to be matched Before match information, finite-state machine DFA, the DFA are built according to the keyword for pre-setting initial All leaf nodes obtain respective output listing respectively during change;Set up each state and its son in the DFA The incidence relation of grandson's status of fail, each state obtains the output listing of descendants's status of fail of the state.
For convenience of description, each several part of apparatus described above is divided into various modules with function or unit is distinguished Description.Certainly, can be the function of each module or unit in same or multiple softwares when the application is implemented Or realized in hardware.
It should be understood by those skilled in the art that, embodiments herein can be provided as method, system or meter Calculation machine program product.Therefore, the application can be using complete hardware embodiment, complete software embodiment or knot Close the form of the embodiment in terms of software and hardware.And, the application can be used and wherein wrapped at one or more Containing computer usable program code computer-usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) on implement computer program product form.
The application is produced with reference to the method according to the embodiment of the present application, equipment (system) and computer program The flow chart and/or block diagram of product is described.It should be understood that can by computer program instructions realize flow chart and / or block diagram in each flow and/or the flow in square frame and flow chart and/or block diagram and/ Or the combination of square frame.These computer program instructions to all-purpose computer, special-purpose computer, insertion can be provided The processor of formula processor or other programmable data processing devices is producing a machine so that by calculating The instruction of the computing device of machine or other programmable data processing devices is produced for realizing in flow chart one The device of the function of being specified in individual flow or multiple one square frame of flow and/or block diagram or multiple square frames.
These computer program instructions may be alternatively stored in can guide computer or the treatment of other programmable datas to set In the standby computer-readable memory for working in a specific way so that storage is in the computer-readable memory Instruction produce include the manufacture of command device, the command device realization in one flow of flow chart or multiple The function of being specified in one square frame of flow and/or block diagram or multiple square frames.
These computer program instructions can be also loaded into computer or other programmable data processing devices, made Obtain and series of operation steps is performed on computer or other programmable devices to produce computer implemented place Reason, so as to the instruction performed on computer or other programmable devices is provided for realizing in flow chart one The step of function of being specified in flow or multiple one square frame of flow and/or block diagram or multiple square frames.
Although having been described for the preferred embodiment of the application, those skilled in the art once know base This creative concept, then can make other change and modification to these embodiments.So, appended right will Ask and be intended to be construed to include preferred embodiment and fall into having altered and changing for the application scope.

Claims (12)

1. a kind of information matching method, it is characterised in that comprise the following steps:
Character sequence according to information to be matched receives the information to be matched;
Information to be matched is matched in the advance finite-state machine DFA for building, the DFA is by pre- If keyword constitute, in the DFA output listing of each state comprising the state descendants fail The output listing of state, descendants's status of fail of the state is identical with the character of the state;
In the matching process, if current state is not next with what the character of the information to be matched matched State, the status of fail for jumping to the current state continues to match.
2. the method for claim 1, it is characterised in that further include:
Export the set of keywords with the information matches to be matched.
3. method as claimed in claim 2, it is characterised in that further include:
It is determined that position of the keyword of matching in the information to be matched;
Matching times according to keyword position, preset keyword weight and keyword are treated described in calculating Importance degree with information.
4. the method for claim 1, it is characterised in that in the character according to information to be matched Order is received after the information to be matched, and information to be matched is matched in finite-state machine DFA Before, further include:
Pretreatment operation is carried out to the information to be matched.
5. method as claimed in claim 4, it is characterised in that the pretreatment operation include it is following at least It is a kind of:
Chinese and English conversion, Chinese-character phonetic letter conversion, either traditional and simplified characters conversion, the conversion of full-shape half-angle, Digital size are write and are turned Change, remove interference character.
6. the method for claim 1, it is characterised in that in the character according to information to be matched Before order receives the information to be matched, further include:
Finite-state machine DFA, all leaves when the DFA is initialized are built according to the keyword for pre-setting Child node obtains respective output listing respectively;
The incidence relation of each state and its descendants's status of fail in the DFA is set up, each state obtains institute State the output listing of descendants's status of fail of state.
7. a kind of information matches device, it is characterised in that including:
Receiver module, the information to be matched is received for the character sequence according to information to be matched;
Matching module, for information to be matched to be matched in the advance finite-state machine DFA for building, The DFA is made up of default keyword, and the output listing of each state includes the shape in the DFA The output listing of descendants's status of fail of state, descendants's status of fail of the state and the character phase of the state Together;In the matching process, if current state is not next with what the character of the information to be matched matched State, the status of fail for jumping to the current state continues to match.
8. device as claimed in claim 7, it is characterised in that further include:
Output module, for exporting the set of keywords with the information matches to be matched.
9. device as claimed in claim 8, it is characterised in that further include:
Determining module, for determining position of the keyword of matching in the information to be matched;
Computing module, for the matching time according to keyword position, preset keyword weight and keyword Number calculates the importance degree of the information to be matched.
10. device as claimed in claim 7, it is characterised in that further include:
Pretreatment module, for receiving the information to be matched in the character sequence according to information to be matched Afterwards, before information to be matched is matched in finite-state machine DFA, to the information to be matched Carry out pretreatment operation.
11. devices as claimed in claim 10, it is characterised in that the pretreatment module specifically for The character sequence according to information to be matched is received after the information to be matched, and information to be matched is being had Before being matched in poor state machine DFA, following at least one operation is carried out to the information to be matched:
Chinese and English conversion, Chinese-character phonetic letter conversion, either traditional and simplified characters conversion, the conversion of full-shape half-angle, Digital size are write and are turned Change, remove interference character.
12. devices as claimed in claim 7, it is characterised in that further include:
State machine models block, for receiving described to be matched in the character sequence according to information to be matched Before information, finite-state machine DFA is built according to the keyword for pre-setting, when the DFA is initialized All leaf nodes obtain respective output listing respectively;Each state in the DFA is set up to be lost with its descendants The incidence relation of state is lost, each state obtains the output listing of descendants's status of fail of the state.
CN201510922330.0A 2015-12-14 2015-12-14 A kind of information matching method and device Pending CN106874294A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510922330.0A CN106874294A (en) 2015-12-14 2015-12-14 A kind of information matching method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510922330.0A CN106874294A (en) 2015-12-14 2015-12-14 A kind of information matching method and device

Publications (1)

Publication Number Publication Date
CN106874294A true CN106874294A (en) 2017-06-20

Family

ID=59178068

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510922330.0A Pending CN106874294A (en) 2015-12-14 2015-12-14 A kind of information matching method and device

Country Status (1)

Country Link
CN (1) CN106874294A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108133052A (en) * 2018-01-18 2018-06-08 广州汇智通信技术有限公司 A kind of searching method of multiple key, system, medium and equipment
CN108564966A (en) * 2018-02-02 2018-09-21 安克创新科技股份有限公司 The method and its equipment of tone testing, the device with store function
CN109492218A (en) * 2018-10-22 2019-03-19 浪潮软件股份有限公司 A kind of synonym quick replacement method based on determining finite-state machine
CN109670972A (en) * 2018-12-11 2019-04-23 泰康保险集团股份有限公司 Core protects method, apparatus, electronic equipment, storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101944121A (en) * 2010-09-20 2011-01-12 北京星网锐捷网络技术有限公司 Generating method of finite state automaton, keyword matching method and devices and equipment
CN102207979A (en) * 2011-06-30 2011-10-05 北京新媒传信科技有限公司 Sensitive word matching method and system
CN102682090A (en) * 2012-04-26 2012-09-19 焦点科技股份有限公司 System and method for matching and processing sensitive words on basis of polymerized word tree
CN102799600A (en) * 2012-04-10 2012-11-28 成都网安科技发展有限公司 Multi-mode matching algorithm and system based on encoding association

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101944121A (en) * 2010-09-20 2011-01-12 北京星网锐捷网络技术有限公司 Generating method of finite state automaton, keyword matching method and devices and equipment
CN102207979A (en) * 2011-06-30 2011-10-05 北京新媒传信科技有限公司 Sensitive word matching method and system
CN102799600A (en) * 2012-04-10 2012-11-28 成都网安科技发展有限公司 Multi-mode matching algorithm and system based on encoding association
CN102682090A (en) * 2012-04-26 2012-09-19 焦点科技股份有限公司 System and method for matching and processing sensitive words on basis of polymerized word tree

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108133052A (en) * 2018-01-18 2018-06-08 广州汇智通信技术有限公司 A kind of searching method of multiple key, system, medium and equipment
CN108564966A (en) * 2018-02-02 2018-09-21 安克创新科技股份有限公司 The method and its equipment of tone testing, the device with store function
CN109492218A (en) * 2018-10-22 2019-03-19 浪潮软件股份有限公司 A kind of synonym quick replacement method based on determining finite-state machine
CN109492218B (en) * 2018-10-22 2023-03-21 浪潮软件股份有限公司 Synonym quick replacement method based on finite state machine determination
CN109670972A (en) * 2018-12-11 2019-04-23 泰康保险集团股份有限公司 Core protects method, apparatus, electronic equipment, storage medium

Similar Documents

Publication Publication Date Title
Wang et al. K-adapter: Infusing knowledge into pre-trained models with adapters
Founta et al. A unified deep learning architecture for abuse detection
CN110717339B (en) Semantic representation model processing method and device, electronic equipment and storage medium
Wang et al. T3: Tree-autoencoder constrained adversarial text generation for targeted attack
US20180336193A1 (en) Artificial Intelligence Based Method and Apparatus for Generating Article
Meng et al. Context-aware neural model for temporal information extraction
CN110097085A (en) Lyrics document creation method, training method, device, server and storage medium
CN109543110A (en) A kind of microblog emotional analysis method and system
CN102314440B (en) Utilize the method and system in network operation language model storehouse
CN108763529A (en) A kind of intelligent search method, device and computer readable storage medium
CN106874294A (en) A kind of information matching method and device
CN109829045A (en) A kind of answering method and device
CN111428025A (en) Text summarization method and device, electronic equipment and storage medium
CN115033668B (en) Story venation construction method and device, electronic equipment and storage medium
Xiao Figure eight at SemEval-2019 task 3: Ensemble of transfer learning methods for contextual emotion detection
Sadiq et al. High dimensional latent space variational autoencoders for fake news detection
Eger et al. dairector: Automatic story beat generation through knowledge synthesis
Szűcs et al. Seq2seq deep learning method for summary generation by lstm with two-way encoder and beam search decoder
Kumar et al. A reliable technique for sentiment analysis on tweets via machine learning and bert
WO2022216462A1 (en) Text to question-answer model system
CN114443846A (en) Classification method and device based on multi-level text abnormal composition and electronic equipment
CN116186246A (en) Public opinion text abstract generation method and related device based on PEGASUS and entity chain
Fadhlurrahman et al. Sentiment analysis of game reviews on steam using bert, bilstm, and crf
Zhou et al. Odsum: New benchmarks for open domain multi-document summarization
Keh et al. Pancetta: Phoneme aware neural completion to elicit tongue twisters automatically

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20200922

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Innovative advanced technology Co.,Ltd.

Address before: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant before: Advanced innovation technology Co.,Ltd.

Effective date of registration: 20200922

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Advanced innovation technology Co.,Ltd.

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Applicant before: Alibaba Group Holding Ltd.

TA01 Transfer of patent application right
RJ01 Rejection of invention patent application after publication

Application publication date: 20170620

RJ01 Rejection of invention patent application after publication