Specific embodiment
To keep the purposes, technical schemes and advantages of the application clearer, below in conjunction with the application specific embodiment and
Technical scheme is clearly and completely described in corresponding attached drawing.Obviously, described embodiment is only the application one
Section Example, instead of all the embodiments.Based on the embodiment in the application, those of ordinary skill in the art are not doing
Every other embodiment obtained under the premise of creative work out, shall fall in the protection scope of this application.
Just it has been observed that existing carry out semantics recognition for text by the way of machine learning, deep learning, it usually needs
Great amount of samples data are trained, and generally require to consume a longer time using the process that sample data is trained.Meanwhile machine
Study and deep learning are in actual business scenario, and operation is also more time-consuming, in the biggish situation for the treatment of capacity, it is difficult to reach
In real time or the identification of near real-time responds.
Therefore in the embodiment of the present application, a kind of text recognition method not depending on machine learning or deep learning is provided, it is right
Text to be identified is fast and accurately identified.
Specifically, this method can use framework as shown in Figure 1a, in fig 1 a as it can be seen that including:It is needed with business
The user asked and the server with text recognition function.
Wherein, user can be by client (such as:Applications client, browser etc.) it realizes and the interaction of server.Institute
The server stated can be business provider (such as:Website, telecom operators, bank, data center etc.) have text knowledge from the background
The server of other function.In practical application, corresponding search engine, intelligent Answer System, intelligence can have been run in server
The operation systems such as customer service system, intelligent chatting system are provided for user and are based on by executing the text recognition method in the application
The business of text identification.Certainly, framework as shown in Figure 1a is merely for convenience of understanding shown in the method in the embodiment of the present application
Simple architecture out, in practical applications, server are also possible to be concentrating type, and provide towards a large number of users and known based on text
Other business service.Here the restriction to the application should not constituted.
It should be noted that the text recognition method in the embodiment of the present application, suitable for the identification scene of different languages,
It in subsequent description, is illustrated mainly for Chinese text, to the treatment process of other language texts, can refer to Chinese text
The description of present treatment process.
In addition, in the embodiment of the present application, word, phrase, phrase, short sentence, long sentence and/or its group that user is inputted
It closes, referred to as " text to be identified ".Under a kind of mode, text to be identified can be defeated by corresponding client-side editing by user
Enter, under another way, text to be identified can be user speech input and the obtained text after voice is converted.This
In should not constitute restriction to the application.
Based on framework as shown in Figure 1a, the embodiment of the present application provides a kind of text-processing process, as shown in Figure 1 b, the mistake
Journey specifically includes following steps:
Step S101:Text to be identified is split, segmentation text is generated.
In the embodiment of the present application, the dividing processing to text to be identified can be realized by participle tool (or algorithm).It can
To understand ground, after being split for text to be identified, several segmentation texts (each segmentation text described in this step can be obtained
This, it will be appreciated that is at least two segmentation texts).Wherein, the segmentation text may include:The word that is obtained after segmentation, phrase and/
Or single text etc..
Such as:For text to be identified " going annual sales amount very low ", segmentation text may include " last year ", " sale
Volume ", " very ", " low ".
What needs to be explained here is that may include in practical applications, in text to be identified nonstandard or colloquial
Term, also, existing partitioning scheme is normally based on itself dictionary and is split processing to text to be identified, its own
Certain specific proprietary words or phrase may and not be included in dictionary.So, obtained each point after dividing processing
It cuts in text, it is possible to include word, word or the phrase divided by mistake.Therefore following step will be executed, it corrects and Optimized Segmentation is literary
This, to be finally completed the identifying processing to text to be identified.
Step S102:According to the received text stored in the received text library pre-established, the segmentation text is carried out
Splicing generates splicing text.
In the embodiment of the present application, the text stored in received text library includes at least:Meet the mark of grammar for natural language
The standardized texts such as quasi- word, standard phrase and proprietary word relevant to business, proprietary phrase.
As one of the embodiment of the present application embodiment, the received text library can further be divided into:Arrange rope
Draw the library (Inverted Index) and standard scores dictionary.Wherein, it is stored in inverted index library and all or part of segmentation text
This corresponding received text collection, it includes at least one received text that received text, which is concentrated, also, the standard that received text is concentrated is literary
Include segmentation text in this.And possible between different segmentation texts splice and combine is stored in standard scores dictionary.In practical behaviour
When making, the determination and acquisition to received text can be realized using technologies such as data mining, text analyzings, it can also be with binding operation
The mode that personnel are manually entered does not constitute the restriction to the application here.
For inverted index library, standard scores dictionary and splicing, will be specifically described in detail in subsequent content,
It does not repeat excessively herein.
It should be noted that in the embodiment of the present application, the segmentation text after splicing is collectively termed as splicing
Text, no matter whether the segmentation text is spliced into new text combination.
After splicing, the matching degree of obtained certain splicing texts and received text is higher, and certain splicings
Text is in contrast lower with the matching degree of received text.In order to determine suitably to splice text, then following steps will be executed
Rapid S103 and S104.
Step S103:Determine the matching characterization value between the splicing text and corresponding whole received texts.
As previously mentioned, segmentation text may correspond to multiple received texts, after splicing, splicing text still may be corresponding
Multiple received texts, still, the matching degree of splicing text and corresponding multiple received texts difference.In order to quantify
Difference, therefore in the embodiment of the present application, it will determine the matching characterization value between splicing text and corresponding whole received texts.
It should be noted that the matching characterization value can be with base as one of the embodiment of the present application feasible pattern
It is calculated in editing distance.Specifically, matching characterization value is:
1- editing distance/max (length (splicing text), length (received text)).
Wherein, length indicates the character length of text;
Editing distance/max (length (splicing text), length (received text)), indicates splicing text and standard text
Difference characterization value (that is, difference degree between the two) between this.
Certainly, the restriction to the application should not be constituted here.
Step S104:According to the matching characterization value, the selected received text to match with the splicing text generates and knows
Other result.
As one of the embodiment of the present application feasible pattern, matching characterization value can be combined according to corresponding matching threshold,
Selection matching characterization value is not less than the received text of the matching threshold, as the received text to match with splicing text.Change speech
It, the semanteme of selected received text out, semantic and corresponding splicing text is essentially identical.Further, also
It is that the received text (in the embodiment of the present application, will splice the process that text conversion is received text, referred to as by splicing text conversion
Normalized), to complete the identification process to text to be identified, generate recognition result.
It will be appreciated that recognition result generated can be called by other operation systems, and provided accordingly based on this
Business service.Here it no longer excessively repeats.
Through the above steps, after receiving text to be identified, text segmentation can be carried out for text to be identified, if generating
Dry segmentation text.It may include word, word or the phrase not divided correctly in these segmentation texts, in this case, it is possible to
According to the received text library pre-established, segmentation text is spliced.Wherein, it is at least stored in the received text library of the application
There are Standard Segmentation text, and the text collection as composed by the received text comprising the Standard Segmentation text.So, exist
During carrying out text splicing, each splicing text spliced, the mark different from text collection can be further determined that out
Matching characterization value between quasi- text, thus, based on matching characterization value, can select suitable received text be matched with it is to be identified
In text, the identifying processing to text to be identified is realized.
For the identifying processing mode for relying on machine learning or deep learning in compared to the prior art, the embodiment of the present application
In text recognition method, do not need collecting sample data and be trained, thus save training optimization process, meanwhile,
Segmentation text is spliced based on received text library, matched process, is able to ascend to keyword/phrase in text to be identified
Accuracy of identification, and phrase can be normalized.
In order to clearly illustrate the above-mentioned text recognition method in the embodiment of the present application, carried out below in conjunction with example detailed
Explanation.
Specifically:
One, inverted index library
Inverted index library in the embodiment of the present application, can be relevant database, such as:MySQL, Hbase etc., can also be with
It is file, such as:Txt file, Excel file etc. do not constitute the restriction of the application here.
In practical applications, inverted index library can be generated based on base text library.Wherein, the base text library,
It is regarded as being stored with the database of a large amount of natural language texts and business text, as previously mentioned, business text can be industry
It is engaged in the service-specific word of provider itself, dedicated phrase etc..Data in inverted index library may originate from base text library.Certainly,
In the embodiment of the present application, a certain database is created as basic text library it is not absolutely required to independent, it, can be by industry when practical application
Business provider backstage is stored with the database of benchmark service text, is stored with the database of natural language text, directly as base
Plinth text library.Certainly, such embodiment does not constitute the restriction to the application.
As one of practical application feasible pattern, inverted index library can be by the way of tables of data, such as the following table 1 institute
Show, is the inverted index library under which:
Table 1
It can be seen in table 1 that the Standard Segmentation text in inverted index library may include word, word, phrase and/or sentence (in table 1
Shown in text only by taking word as an example).For each Standard Segmentation text being stored in inverted index library, all have unique
Text label.It should be noted that Standard Segmentation text shown in table 1, it can be by segmenting tool (participle accordingly
In tool and foregoing teachings, used participle tool is identical when being split to text to be identified) it is directed to received text in advance
It is obtained after executing text segmentation processing.
In inverted index library, each content that received text is concentrated is all made of " text, the format of code, type "
Expression, it should be appreciated that the format is only a kind of example, and different format structures can be used in practical application.
Specifically, " text " in the structure, is exactly received text described in the embodiment of the present application, the received text
Include or part includes corresponding Standard Segmentation text.It just as shown in table 1, include mark in text " electric business class refund amount "
Standard divides text " electric business ", includes Standard Segmentation text " consumption " in text " pre-capita consumption ".
" code " indicates text coding corresponding in base text library, that is to say, that, can be in base by the coding
Content of text corresponding to the coding or relevant information are found in plinth text library.Such as:It, can be in base according to coding " KPI0001 "
Text " electric business refund amount " is found in plinth text library.
" type " indicates text type corresponding to received text.In the embodiment of the present application, the text type can
To include:Business object name, attribute-name, attribute value.
Wherein, business object is regarded as object involved in the different business service of business provider.Including but
It is not limited to:Service product, user, all kinds of operational indicators etc..Correspondingly, business object name is exactly the specific name of business object
Claim, such as:Service product name, user name, operational indicator name etc..
Attribute-name is regarded as the title of service attribute possessed by business object, such as:For business object " shop "
Speech, attribute-name may include:Store address, pre-capita consumption etc..For another example:For business object " user ", attribute-name can
Including:Gender, age etc..
Attribute value is regarded as value corresponding to above-mentioned attribute-name, such as:The value of " gender " includes:Male, women.
To sum up, such as:Assuming that " pre-capita consumption of KFC Hangzhou Wen Sandian is how many ", then, text " Ken De therein
The text type of base Hangzhou Wen Sandian " is:The text type of business object name (that is, store name), text " pre-capita consumption " is:Belong to
Property name.
Two, standard scores dictionary
The standard scores dictionary, equally can be by the way of relevant database/table.It is interior in standard scores dictionary
Hold, it is believed that be based on the obtained word combination of N-gram word segmentation processing.It, can be according to the needs of practical application when practical application
The value of N is set, is not especially limited here.
It should be noted that the content stored in standard scores dictionary, the usually word segmentation result of received text.
As one of the embodiment of the present application feasible pattern, standard scores dictionary can (table 2 be binary text as shown in table 2 below
Faku County):
Text label |
Text |
Participle mode |
1 |
Electric business consumption |
2-gram |
2 |
Refund amount |
2-gram |
3 |
Spending amount |
2-gram |
4 |
Pre-capita consumption |
2-gram |
Table 2
Text in table 2 typically originates from received text above-mentioned, such as " refund amount ", " pre-capita consumption ", indicates
The normalization of received text is divided.When practical application, in text to be identified may include received text, but by participle tool into
After row segmentation, the segmentation text such as " reimbursement ", " amount of money ", " per capita ", " consumption " is just formed, then, in the splicing of text,
Standard scores dictionary as shown in Table 2 can provide reference to the splicing of segmentation text, at the same time as one in splicing
Kind Rule of judgment (will illustrate in subsequent content herein).
Participle mode in table 2 is two-dimensional grammar participle, in practical applications, can also use other kinds of participle language
Method, 3-gram (three metagrammars participle), 4-gram (four metagrammars participle) etc..
Certainly, table 2 is only a kind of example of standard scores dictionary, should not constitute the restriction to the application.
Therefore the process for pre-establishing received text library may include:Obtain received text, to the received text into
Row dividing processing generates Standard Segmentation text;
Using each Standard Segmentation text as index, statistics includes or part includes the received text of the index, forms mark
Quasi- text set, and the corresponding relationship of the received text collection and the index is established, it is described based on corresponding relationship foundation
Arrange index database.
According to the participle grammer of setting, the text combination of the participle grammer is determined for compliance in the Standard Segmentation text,
The standard scores dictionary is established based on the text combination.
In the embodiment of the present application, splicing is carried out to the segmentation text, specially:According to multiple segmentation texts
It puts in order, selected segmentation text is spliced.
Wherein, the multiple to divide putting in order for text, with the text alignment sequence consensus in the text to be identified.
Selected segmentation text is spliced, specially:Selected starting segmentation text is searched in the inverted index library
Index corresponding with starting segmentation text, if finding index corresponding with starting segmentation text, it is determined that the index
Corresponding received text collection is spliced according to the received text that received text is concentrated;Divide text with the starting if not finding
This corresponding index selectes the latter segmentation text as starting segmentation text, and divides text according to again selected starting
Spliced.
Spliced according to the received text that received text is concentrated, specially:It is selected to be arranged in the starting segmentation text
Adjacent segmentation text later carries out accumulative splicing, generates splicing text, determines the splicing text and corresponding received text
Difference characterization value between the received text of concentration records the splicing text according to the difference characterization value, and sentences
It is disconnected whether to terminate to be spliced with selected starting segmentation text, if so, screening splicing text, and starting segmentation is selected again
Text;Otherwise, then the adjacent segmentation text being arranged in after the splicing text is selected, accumulative splicing is continued.
Determine the difference characterization value spliced between the received text that text is concentrated with corresponding received text, specifically
For:The editing distance between received text concentrated according to the splicing text with corresponding received text, calculates the difference
Characterization value.
The splicing text is recorded, specially:When the splicing text meets the record condition of setting, to institute
Splicing text is stated to be recorded.
Wherein, the record condition of the setting, including:Splice the corresponding difference characterization value of text, less than current minimum
Difference characterization value and the difference of setting characterize threshold value.
Judge whether to terminate to be spliced with selected starting segmentation text, specially:When the difference characterization value is discontented
When foot imposes a condition, determine to terminate to be spliced with selected starting segmentation text;When the difference characterization value meets setting item
When part, determine to continue to be spliced with selected starting segmentation text;
Wherein, the setting condition includes:
Minimum difference characterization value does not continue to increase setting number, and
Minimum difference characterization value corresponding to current splicing text, no more than minimum difference corresponding to previous splicing text
Characterization value, or, the text combination between currently selected fixed segmentation text and latter segmentation text, is present in standard scores dictionary
In.
In conjunction with the content in above-mentioned inverted index library and standard scores dictionary, now it is illustrated for identification process:
Specifically for example:Assuming that the text to be identified of user's input is " the electric business amount of consumption of female user sorts by constellation ",
So, after being split using existing dividing method to the text to be identified, available following segmentation text:
The electric business amount of consumption of female user sorts by constellation
As it can be seen that having obtained 9 segmentation texts after text segmentation.In this 9 segmentation texts, comprising by erroneous segmentation
Text, therefore will be for this 9 segmentation texts progress splicings.It is noted herein that the gained after text segmentation
The multiple segmentation texts arrived, substantially ordered arrangement (that is, form segmentation text sequence, put in order with it is to be identified
The sequence consensus of text), so will also be based on the segmentation text sequence when being spliced.
Firstly, selecting first segmentation text (namely starting segmentation text), that is, " female in above-mentioned segmentation text sequence
Property ".Index corresponding with the segmentation text is searched in inverted index library.It is assumed herein that using table 1 as inverted index library, that
, according to the segmentation text, the index " women " in table 1 can be hit.Further, index " women " can be determined in table 1
Corresponding received text collection, it only includes a received text which, which concentrates,:" women ".Obviously, the segmentation text with
Received text accurately matches, and matching characterization value is 1.0 (difference characterization value is 0).At this point, can also be obtained to this splicing
Splicing result recorded, such as:Record starting segmentation text, received text, matching characterization value and the text class of this splicing
Type etc..Here the restriction to the application should not constituted.
It should be noted that in the embodiment of the present application, to continue to be spliced based on above-mentioned starting segmentation text,
Need to meet the splicing condition of setting, wherein the splicing condition may include:
Condition one, minimum difference characterization value do not continue to increase setting number (for convenient for statement, here by the secondary number scale of setting
For " thresA ").In general, the value of thresA is 2.Specifically, thresA judge whether to continue to splice as one kind it is adjacent
Divide the condition of text, the value of the thresA can also provide the support to fuzzy matching.
Minimum difference characterization value corresponding to condition two, current splicing text, no more than corresponding to previous splicing text
Minimum difference characterization value, alternatively, the text combination between currently selected fixed segmentation text and latter segmentation text, is present in mark
In quasi- participle library.
In any splicing, above-mentioned condition need to be met simultaneously, can just be continued based on starting point selected before
It cuts text to be spliced, otherwise, starting segmentation text will be selected again and spliced.
Later, it for segmentation text sequence, selectes latter segmentation text and continues to splice with starting segmentation text, that is,
Selected " user " splices with selected before " women ", obtains splicing text " female user ".At this point, the splicing text is still
It corresponding to above-mentioned received text collection, can not accurately be matched with received text " women ", difference characterization value between the two
It is 0.5.
Clearly as currently splicing the difference characterization value 0.5 of text " female user ", it is greater than previous splicing text " women "
Difference characterization value 0, therefore be unsatisfactory for above-mentioned condition two, thus, starting segmentation text will be selected again.
In this example, segmentation text " user " is selected as new starting and divide text, and look into inverted index library
The index to match, and miss are looked for, therefore continues to select starting segmentation text again.
Selected segmentation text " " as new starting divide text, equally, the simultaneously miss in inverted index library, still after
It is continuous to select starting segmentation text again.
Segmentation text " electric business " is selected at this time as new starting segmentation text has hit index in inverted index library
" electric business ".Further, it can determine that received text collection corresponding to index " electric business ", the received text concentrate packet in table 1
Containing two received texts:" electric business refund amount " and " electric business spending amount ".Starting can be divided text " electric business " conduct
Splice text, and calculates separately the difference characterization value between received text " electric business refund amount " and " electric business spending amount ".
Wherein, the difference characterization value spliced between text " electric business " and received text " electric business refund amount " and " electric business spending amount " is equal
It is 4/6, i.e., 0.67.So, currently splicing minimum difference characterization value corresponding to text " electric business " is just 0.67.Similarly, may be used
The splicing result obtained to this splicing records, and no longer excessively repeats here.
The minimum difference characterization value 0.67 of this splicing result does not continue to increase thresA times (meeting condition one), and " electricity
Quotient " and its latter segmentation text " consumption " are recorded in standard scores dictionary, that is, (meet condition two) in table 2.So base will be continued
Spliced in " electric business " and (enters next round to splice).
Selected to be arranged in " electric business " the latter segmentation text " consumption " and spliced, obtaining new splicing text, " electric business disappears
Take ".The splicing text has hit index " electric business " (herein it should be noted that in splicing text by more in inverted index library
In the case where a segmentation text composition, it will be looked into inverted index library based on the first segmentation text in the splicing text
Look for), corresponding two received texts are still:" electric business refund amount " and " electric business spending amount ".Calculate difference characterization value point
It is not:4/6 (0.67) and 2/6 (0.33).In the case, difference characterization value 0.33 is less than current minimum difference characterization value
0.67, therefore current minimum difference characterization value is updated to 0.33 (that is, received text " electric business spending amount " and current
Splice text " electric business consumption " more to match).Record the splicing result of this splicing.
The minimum difference characterization value of this splicing result is reduced to 0.33 from 0.67, does not occur continuing to increase thresA times
The case where (meet condition one), and updated minimum difference characterization value 0.33 be less than last round of splicing minimum difference characterization value
0.67 (meeting condition two).So will continue to be spliced based on " electric business consumption " and (enter next round to splice).
It is selected to be arranged in " consumption " the latter segmentation text " amount of money " and carry out accumulative splicing, it is " electric to obtain new splicing text
Quotient's spending amount ".Foregoing teachings can be referred to the treatment process of the splicing text, it may be determined that splicing text " the electric business consumption
The difference characterization value of volume " and received text " electric business spending amount " is 0.17.Correspondingly, by current minimum difference characterization value
0.33 is updated to 0.17.And record this splicing result.
Subsequent process and so on, which is not described herein again.
What needs to be explained here is that as a kind of feasible embodiment, in above-mentioned splicing, record concatenation knot
Fruit can be by the way of tables of data.
Such as:For above-mentioned splicing text " women ", splicing result can be as shown in table 3 below.
Serial number |
Splice text |
Received text |
Match characterization value |
1 |
Women |
Women |
1.0 |
Table 3
The splicing result generated in splicing later, can bulk registration in table 3.Such as:Such as the following table 4 institute
Show.
Serial number |
Splice text |
Received text |
Match characterization value |
1 |
Women |
Women |
1.0 |
2 |
Electric business |
Electric business spending amount |
0.33 |
3 |
Electric business |
Electric business refund amount |
0.33 |
Table 4
It is to be noted that in the embodiment of the present application, any splicing result all bulk registrations not obtained are above-mentioned
Table in, but after different splicing results meet certain record condition, upper table is updated.Specifically, the note
Record condition may include:
Splice the corresponding difference characterization value of text<Current minimum difference characterization value.
Certainly, it is contemplated that the quantity that the splicing result of the condition can be able to satisfy in practical application scene is more, so,
The quantity of record will be will increase, in order to guarantee treatment effeciency, additional additional conditions can be increased in above-mentioned record condition,
I.e.:Splice the corresponding difference characterization value of text<Difference characterization threshold value is set (here to characterize the difference of setting convenient for statement
Threshold value is denoted as " thresB ").In general, the value of thresB is 0.93.
Under the action of additional additional conditions, a certain number of splicing results can be directly filtered out, are reduced meaningless
Operation.It should be understood, of course, that the value of thresB is higher, the splicing result directly filtered out is fewer (to have more splicing results may
It is recorded), recognition accuracy can increased to a certain degree, but reduce treatment effeciency;Conversely, the splicing directly filtered out
As a result more (less splicing result is recorded) are reducing recognition accuracy to a certain degree, but are able to ascend processing effect
Rate.That is, the value of thresB, can be set according to the needs of practical application (0.93 in this example, it is only a kind of
More excellent value).
Continue this example, it is assumed that after splicing, obtained splicing result is as shown in table 5 below.
Serial number |
Splice text |
Received text |
Match characterization value |
1 |
Women |
Women |
1.0 |
2 |
Constellation |
Constellation |
1.0 |
3 |
The electric business amount of consumption |
Electric business spending amount |
0.83 |
4 |
Electric business consumption |
Electric business spending amount |
0.67 |
5 |
Electric business consumption |
Electric business refund amount |
0.33 |
6 |
Electric business |
Electric business spending amount |
0.33 |
Table 5
At this point, according to matching characterization threshold value (for matching characterization threshold value is denoted as " thresC " here convenient for statement), to table
Splicing result in 5 is screened, it is assumed that and the value of thresC is 0.6, then, selection matching characterization value is each greater than 0.6
Received text is normalized.
At this point, the segmentation result that obtains that treated:
The electric business spending amount of female user sorts by constellation
Certainly, signable corresponding crucial out when generating recognition result as one of the embodiment of the present application mode
The text type of text, such as:
{"name":" women ", " type ":" attribute value ", " score ":1.0}
{"name":" constellation ", " type ":" attribute-name ", " score ":1.0}
{"name":" electric business spending amount ", " type ":" business object name ", " score ":0.83}
Certainly, in various embodiments, there can also be different forms, such as:Anti- document frequency can be increased
(Inverse Document Frequency, IDF) information etc., should not constitute the restriction to the application here.
The above are text recognition methods provided by the embodiments of the present application, are based on same thinking, and the embodiment of the present application also mentions
For a kind of text identification device, as shown in Figure 2.The device includes:
Text segmentation module 201 is split text to be identified, generates segmentation text;
Text splicing module 202, according to the received text stored in the received text library pre-established, to the segmentation text
This progress splicing generates splicing text;Wherein, it in the received text library, is at least stored with and partly or entirely divides
The corresponding received text collection of text, it includes at least one received text that the received text, which is concentrated,;
Score value determining module 203 determines the matching characterization value between the splicing text and corresponding received text;
Result-generation module 204, according to the matching characterization value, the selected standard text to match with the splicing text
This, generates recognition result.
The received text library includes at least:Inverted index library and standard scores dictionary, described device further include:Text library wound
Block 205 is modeled, received text is obtained, processing is split to the received text and generates Standard Segmentation text, by each standard
Divide text and be used as index, statistics includes or part includes the received text of the index, forms received text collection, and described in foundation
The corresponding relationship of received text collection and the index establishes the inverted index library based on the corresponding relationship, according to setting
Grammer is segmented, the text combination of the participle grammer is determined for compliance in the Standard Segmentation text, is built jointly based on the group of text
Found the standard scores dictionary;
Wherein, the participle grammer of the setting includes at least:Two-dimensional grammar, three metagrammars.
The text splicing module 202, according to putting in order for multiple segmentation texts, selected segmentation text is spliced.
Wherein, the multiple to divide putting in order for text, with the text alignment sequence consensus in the text to be identified.
The text splicing module 202 is selected starting segmentation text and is searched and the starting in the inverted index library
Divide the corresponding index of text;
If finding index corresponding with starting segmentation text, it is determined that described to index corresponding received text collection, root
Spliced according to the received text that received text is concentrated;
If not finding index corresponding with starting segmentation text, the latter segmentation text is selected as starting segmentation text
This, and spliced according to again selected starting segmentation text.
The text splicing module 202, the selected adjacent segmentation text being arranged in after the starting segmentation text, carries out
Accumulative splicing generates splicing text, determines the difference spliced between the received text that text is concentrated with corresponding received text
Different characterization value records the splicing text according to the difference characterization value, and judges whether to terminate with selected starting
Segmentation text is spliced;
If so, screening splicing text, and starting segmentation text is selected again;
Otherwise, then the adjacent segmentation text being arranged in after the splicing text is selected, accumulative splicing is continued.
The text splicing module 202, the received text concentrated according to splicing text and the corresponding received text it
Between editing distance, calculate the difference characterization value.
The text splicing module 202, when the splicing text meets the record condition of setting, to the splicing text
It is recorded.
Wherein, the record condition of the setting, including:Splice the corresponding difference characterization value of text, less than current minimum
Difference characterization value and the difference of setting characterize threshold value.
The text splicing module 202 determines to terminate to select when the difference characterization value is unsatisfactory for imposing a condition
Starting segmentation text is spliced, and when the difference characterization value, which meets, to impose a condition, determines to continue to divide with selected starting
Text is spliced.
Wherein, the setting condition includes:Minimum difference characterization value does not continue to increase setting number, and
Minimum difference characterization value corresponding to current splicing text, no more than minimum difference corresponding to previous splicing text
Characterization value, or, the text combination between currently selected fixed segmentation text and latter segmentation text, is present in standard scores dictionary
In.
The score value determining module 203, the received text concentrated according to splicing text and the corresponding received text it
Between editing distance, calculate the matching characterization value.
In the 1990s, the improvement of a technology can be distinguished clearly be on hardware improvement (for example,
Improvement to circuit structures such as diode, transistor, switches) or software on improvement (improvement for method flow).So
And with the development of technology, the improvement of current many method flows can be considered as directly improving for hardware circuit.
Designer nearly all obtains corresponding hardware circuit by the way that improved method flow to be programmed into hardware circuit.Cause
This, it cannot be said that the improvement of a method flow cannot be realized with hardware entities module.For example, programmable logic device
(Programmable Logic Device, PLD) (such as field programmable gate array (Field Programmable Gate
Array, FPGA)) it is exactly such a integrated circuit, logic function determines device programming by user.By designer
Voluntarily programming comes a digital display circuit " integrated " on a piece of PLD, designs and makes without asking chip maker
Dedicated IC chip.Moreover, nowadays, substitution manually makes IC chip, this programming is also used instead mostly " is patrolled
Volume compiler (logic compiler) " software realizes that software compiler used is similar when it writes with program development,
And the source code before compiling also write by handy specific programming language, this is referred to as hardware description language
(Hardware Description Language, HDL), and HDL is also not only a kind of, but there are many kind, such as ABEL
(Advanced Boolean Expression Language)、AHDL(Altera Hardware Description
Language)、Confluence、CUPL(Cornell University Programming Language)、HDCal、JHDL
(Java Hardware Description Language)、Lava、Lola、MyHDL、PALASM、RHDL(Ruby
Hardware Description Language) etc., VHDL (Very-High-Speed is most generally used at present
Integrated Circuit Hardware Description Language) and Verilog.Those skilled in the art also answer
This understands, it is only necessary to method flow slightly programming in logic and is programmed into integrated circuit with above-mentioned several hardware description languages,
The hardware circuit for realizing the logical method process can be readily available.
Controller can be implemented in any suitable manner, for example, controller can take such as microprocessor or processing
The computer for the computer readable program code (such as software or firmware) that device and storage can be executed by (micro-) processor can
Read medium, logic gate, switch, specific integrated circuit (Application Specific Integrated Circuit,
ASIC), the form of programmable logic controller (PLC) and insertion microcontroller, the example of controller includes but is not limited to following microcontroller
Device:ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20 and Silicone Labs C8051F320, are deposited
Memory controller is also implemented as a part of the control logic of memory.It is also known in the art that in addition to
Pure computer readable program code mode is realized other than controller, can be made completely by the way that method and step is carried out programming in logic
Controller is obtained to come in fact in the form of logic gate, switch, specific integrated circuit, programmable logic controller (PLC) and insertion microcontroller etc.
Existing identical function.Therefore this controller is considered a kind of hardware component, and to including for realizing various in it
The device of function can also be considered as the structure in hardware component.Or even, it can will be regarded for realizing the device of various functions
For either the software module of implementation method can be the structure in hardware component again.
System, device, module or the unit that above-described embodiment illustrates can specifically realize by computer chip or entity,
Or it is realized by the product with certain function.It is a kind of typically to realize that equipment is computer.Specifically, computer for example may be used
Think personal computer, laptop computer, cellular phone, camera phone, smart phone, personal digital assistant, media play
It is any in device, navigation equipment, electronic mail equipment, game console, tablet computer, wearable device or these equipment
The combination of equipment.
For convenience of description, it is divided into various units when description apparatus above with function to describe respectively.Certainly, implementing this
The function of each unit can be realized in the same or multiple software and or hardware when application.
It should be understood by those skilled in the art that, the embodiment of the present invention can provide as method, system or computer program
Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the present invention
Apply the form of example.Moreover, it wherein includes the computer of computer usable program code that the present invention, which can be used in one or more,
The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces
The form of product.
The present invention be referring to according to the method for the embodiment of the present invention, the process of equipment (system) and computer program product
Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions
The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs
Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce
A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real
The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy
Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates,
Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or
The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting
Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or
The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one
The step of function of being specified in a box or multiple boxes.
In a typical configuration, calculating equipment includes one or more processors (CPU), input/output interface, net
Network interface and memory.
Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/or
The forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM).Memory is computer-readable medium
Example.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method
Or technology come realize information store.Information can be computer readable instructions, data structure, the module of program or other data.
The example of the storage medium of computer includes, but are not limited to phase change memory (PRAM), static random access memory (SRAM), moves
State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable
Programmable read only memory (EEPROM), flash memory or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM),
Digital versatile disc (DVD) or other optical storage, magnetic cassettes, tape magnetic disk storage or other magnetic storage devices
Or any other non-transmission medium, can be used for storage can be accessed by a computing device information.As defined in this article, it calculates
Machine readable medium does not include temporary computer readable media (transitory media), such as the data-signal and carrier wave of modulation.
It should also be noted that, the terms "include", "comprise" or its any other variant are intended to nonexcludability
It include so that the process, method, commodity or the equipment that include a series of elements not only include those elements, but also to wrap
Include other elements that are not explicitly listed, or further include for this process, method, commodity or equipment intrinsic want
Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including described want
There is also other identical elements in the process, method of element, commodity or equipment.
It will be understood by those skilled in the art that embodiments herein can provide as method, system or computer program product.
Therefore, complete hardware embodiment, complete software embodiment or embodiment combining software and hardware aspects can be used in the application
Form.It is deposited moreover, the application can be used to can be used in the computer that one or more wherein includes computer usable program code
The shape for the computer program product implemented on storage media (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.)
Formula.
The application can describe in the general context of computer-executable instructions executed by a computer, such as program
Module.Generally, program module includes routine, programs, objects, the group for executing particular transaction or realizing particular abstract data type
Part, data structure etc..The application can also be practiced in a distributed computing environment, in these distributed computing environments, by
Affairs are executed by the connected remote processing devices of communication network.In a distributed computing environment, program module can be with
In the local and remote computer storage media including storage equipment.
All the embodiments in this specification are described in a progressive manner, same and similar portion between each embodiment
Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for system reality
For applying example, since it is substantially similar to the method embodiment, so being described relatively simple, related place is referring to embodiment of the method
Part explanation.
The above description is only an example of the present application, is not intended to limit this application.For those skilled in the art
For, various changes and changes are possible in this application.All any modifications made within the spirit and principles of the present application are equal
Replacement, improvement etc., should be included within the scope of the claims of this application.