CN108875743A

CN108875743A - A kind of text recognition method and device

Info

Publication number: CN108875743A
Application number: CN201710337521.XA
Authority: CN
Inventors: 王凯; 毛仁歆
Original assignee: Alibaba Group Holding Ltd
Current assignee: Advanced New Technologies Co Ltd; Advantageous New Technologies Co Ltd
Priority date: 2017-05-15
Filing date: 2017-05-15
Publication date: 2018-11-23
Anticipated expiration: 2037-05-15
Also published as: CN108875743B

Abstract

The embodiment of the present application discloses a kind of text recognition method and device, this method include：Text to be identified is split, segmentation text is generated, according to the received text stored in the received text library pre-established, splicing is carried out to the segmentation text, generates splicing text；Wherein, in the received text library, at least it is stored with received text collection corresponding with partly or entirely segmentation text, it includes at least one received text that the received text, which is concentrated, determine the matching characterization value between the splicing text and corresponding received text, according to the matching characterization value, the selected received text to match with the splicing text generates recognition result.By this method, does not need collecting sample data and be trained, so that the process of training optimization is saved, meanwhile, segmentation text is spliced based on received text library, matched process, is able to ascend to keyword/phrase accuracy of identification in text to be identified.

Description

A kind of text recognition method and device

Technical field

This application involves field of computer technology more particularly to a kind of text recognition methods and device.

Background technique

Currently, the application scenarios for carrying out semantics recognition to text are increasing with the development of information technology, such as：Intelligence The business scenarios such as energy question and answer, intelligent customer service, search engine.Semantics recognition technology has become current hot research topic.

Existing semantics recognition technology mostly uses the mode of machine learning or deep learning to carry out semantics recognition.However, machine Device study or deep learning are required to carry out training, optimization repeatedly using a large amount of sample data, to promote identification model Recognition accuracy, the process are relatively complicated.Also, it is applied in actual business scenario using machine learning or deep learning When, it runs time-consuming usually longer.

In other words, the lower and efficient identifying processing for text to be identified progress fussy degree, becomes urgently to be resolved Problem.

Summary of the invention

The embodiment of the present application provides a kind of text recognition method and device, existing based on machine learning or depth to solve There is certain defect in the text identification of study.

A kind of text recognition method provided by the embodiments of the present application, including：

Text to be identified is split, segmentation text is generated；

According to the received text stored in the received text library pre-established, splicing is carried out to the segmentation text, Generate splicing text；Wherein, in the received text library, at least it is stored with standard text corresponding with partly or entirely segmentation text This collection, it includes at least one received text that the received text, which is concentrated,；

Determine the matching characterization value between the splicing text and corresponding received text；

According to the matching characterization value, the selected received text to match with the splicing text generates recognition result.

A kind of text identification device provided by the embodiments of the present application, including：

Text segmentation module is split text to be identified, generates segmentation text；

Text splicing module, according to the received text stored in the received text library pre-established, to the segmentation text Splicing is carried out, splicing text is generated；Wherein, it in the received text library, is at least stored with and partly or entirely segmentation text This corresponding received text collection, it includes at least one received text that the received text, which is concentrated,；

Score value determining module determines the matching characterization value between the splicing text and corresponding received text；

Result-generation module, according to the matching characterization value, the selected received text to match with the splicing text is raw At recognition result.

At least one above-mentioned technical solution that the embodiment of the present application uses can reach following beneficial effect：

After receiving text to be identified, text segmentation can be carried out for text to be identified, generate several segmentation texts. It may include word, word or the phrase not divided correctly, in this case, it is possible to according to pre-establishing in these segmentation texts Received text library, to segmentation text splice.Wherein, Standard Segmentation text is at least stored in the received text library of the application This, and the text collection as composed by the received text comprising the Standard Segmentation text.So, text splicing is being carried out During, each splicing text spliced can be further determined that out, between the received text different from text collection Characterization value is matched, thus, based on matching characterization value, it can select suitable received text and be matched in text to be identified, realize To the identifying processing of text to be identified.

For the identifying processing mode for relying on machine learning or deep learning in compared to the prior art, the embodiment of the present application In text recognition method, do not need collecting sample data and be trained, thus save training optimization process, meanwhile, Segmentation text is spliced based on received text library, matched process, is able to ascend to keyword/phrase in text to be identified Accuracy of identification, and phrase can be normalized.

Detailed description of the invention

The drawings described herein are used to provide a further understanding of the present application, constitutes part of this application, this Shen Illustrative embodiments and their description please are not constituted an undue limitation on the present application for explaining the application.In the accompanying drawings：

Fig. 1 a is the configuration diagram that text recognition method provided by the embodiments of the present application is based on；

Fig. 1 b is text-processing process provided by the embodiments of the present application；

Fig. 2 is text identification apparatus structure schematic diagram provided by the embodiments of the present application.

Specific embodiment

To keep the purposes, technical schemes and advantages of the application clearer, below in conjunction with the application specific embodiment and Technical scheme is clearly and completely described in corresponding attached drawing.Obviously, described embodiment is only the application one Section Example, instead of all the embodiments.Based on the embodiment in the application, those of ordinary skill in the art are not doing Every other embodiment obtained under the premise of creative work out, shall fall in the protection scope of this application.

Just it has been observed that existing carry out semantics recognition for text by the way of machine learning, deep learning, it usually needs Great amount of samples data are trained, and generally require to consume a longer time using the process that sample data is trained.Meanwhile machine Study and deep learning are in actual business scenario, and operation is also more time-consuming, in the biggish situation for the treatment of capacity, it is difficult to reach In real time or the identification of near real-time responds.

Therefore in the embodiment of the present application, a kind of text recognition method not depending on machine learning or deep learning is provided, it is right Text to be identified is fast and accurately identified.

Specifically, this method can use framework as shown in Figure 1a, in fig 1 a as it can be seen that including：It is needed with business The user asked and the server with text recognition function.

Wherein, user can be by client (such as：Applications client, browser etc.) it realizes and the interaction of server.Institute The server stated can be business provider (such as：Website, telecom operators, bank, data center etc.) have text knowledge from the background The server of other function.In practical application, corresponding search engine, intelligent Answer System, intelligence can have been run in server The operation systems such as customer service system, intelligent chatting system are provided for user and are based on by executing the text recognition method in the application The business of text identification.Certainly, framework as shown in Figure 1a is merely for convenience of understanding shown in the method in the embodiment of the present application Simple architecture out, in practical applications, server are also possible to be concentrating type, and provide towards a large number of users and known based on text Other business service.Here the restriction to the application should not constituted.

It should be noted that the text recognition method in the embodiment of the present application, suitable for the identification scene of different languages, It in subsequent description, is illustrated mainly for Chinese text, to the treatment process of other language texts, can refer to Chinese text The description of present treatment process.

In addition, in the embodiment of the present application, word, phrase, phrase, short sentence, long sentence and/or its group that user is inputted It closes, referred to as " text to be identified ".Under a kind of mode, text to be identified can be defeated by corresponding client-side editing by user Enter, under another way, text to be identified can be user speech input and the obtained text after voice is converted.This In should not constitute restriction to the application.

Based on framework as shown in Figure 1a, the embodiment of the present application provides a kind of text-processing process, as shown in Figure 1 b, the mistake Journey specifically includes following steps：

Step S101：Text to be identified is split, segmentation text is generated.

In the embodiment of the present application, the dividing processing to text to be identified can be realized by participle tool (or algorithm).It can To understand ground, after being split for text to be identified, several segmentation texts (each segmentation text described in this step can be obtained This, it will be appreciated that is at least two segmentation texts).Wherein, the segmentation text may include：The word that is obtained after segmentation, phrase and/ Or single text etc..

Such as：For text to be identified " going annual sales amount very low ", segmentation text may include " last year ", " sale Volume ", " very ", " low ".

What needs to be explained here is that may include in practical applications, in text to be identified nonstandard or colloquial Term, also, existing partitioning scheme is normally based on itself dictionary and is split processing to text to be identified, its own Certain specific proprietary words or phrase may and not be included in dictionary.So, obtained each point after dividing processing It cuts in text, it is possible to include word, word or the phrase divided by mistake.Therefore following step will be executed, it corrects and Optimized Segmentation is literary This, to be finally completed the identifying processing to text to be identified.

Step S102：According to the received text stored in the received text library pre-established, the segmentation text is carried out Splicing generates splicing text.

In the embodiment of the present application, the text stored in received text library includes at least：Meet the mark of grammar for natural language The standardized texts such as quasi- word, standard phrase and proprietary word relevant to business, proprietary phrase.

As one of the embodiment of the present application embodiment, the received text library can further be divided into：Arrange rope Draw the library (Inverted Index) and standard scores dictionary.Wherein, it is stored in inverted index library and all or part of segmentation text This corresponding received text collection, it includes at least one received text that received text, which is concentrated, also, the standard that received text is concentrated is literary Include segmentation text in this.And possible between different segmentation texts splice and combine is stored in standard scores dictionary.In practical behaviour When making, the determination and acquisition to received text can be realized using technologies such as data mining, text analyzings, it can also be with binding operation The mode that personnel are manually entered does not constitute the restriction to the application here.

For inverted index library, standard scores dictionary and splicing, will be specifically described in detail in subsequent content, It does not repeat excessively herein.

It should be noted that in the embodiment of the present application, the segmentation text after splicing is collectively termed as splicing Text, no matter whether the segmentation text is spliced into new text combination.

After splicing, the matching degree of obtained certain splicing texts and received text is higher, and certain splicings Text is in contrast lower with the matching degree of received text.In order to determine suitably to splice text, then following steps will be executed Rapid S103 and S104.

Step S103：Determine the matching characterization value between the splicing text and corresponding whole received texts.

As previously mentioned, segmentation text may correspond to multiple received texts, after splicing, splicing text still may be corresponding Multiple received texts, still, the matching degree of splicing text and corresponding multiple received texts difference.In order to quantify Difference, therefore in the embodiment of the present application, it will determine the matching characterization value between splicing text and corresponding whole received texts.

It should be noted that the matching characterization value can be with base as one of the embodiment of the present application feasible pattern It is calculated in editing distance.Specifically, matching characterization value is：

1- editing distance/max (length (splicing text), length (received text)).

Wherein, length indicates the character length of text；

Editing distance/max (length (splicing text), length (received text)), indicates splicing text and standard text Difference characterization value (that is, difference degree between the two) between this.

Certainly, the restriction to the application should not be constituted here.

Step S104：According to the matching characterization value, the selected received text to match with the splicing text generates and knows Other result.

As one of the embodiment of the present application feasible pattern, matching characterization value can be combined according to corresponding matching threshold, Selection matching characterization value is not less than the received text of the matching threshold, as the received text to match with splicing text.Change speech It, the semanteme of selected received text out, semantic and corresponding splicing text is essentially identical.Further, also It is that the received text (in the embodiment of the present application, will splice the process that text conversion is received text, referred to as by splicing text conversion Normalized), to complete the identification process to text to be identified, generate recognition result.

It will be appreciated that recognition result generated can be called by other operation systems, and provided accordingly based on this Business service.Here it no longer excessively repeats.

Through the above steps, after receiving text to be identified, text segmentation can be carried out for text to be identified, if generating Dry segmentation text.It may include word, word or the phrase not divided correctly in these segmentation texts, in this case, it is possible to According to the received text library pre-established, segmentation text is spliced.Wherein, it is at least stored in the received text library of the application There are Standard Segmentation text, and the text collection as composed by the received text comprising the Standard Segmentation text.So, exist During carrying out text splicing, each splicing text spliced, the mark different from text collection can be further determined that out Matching characterization value between quasi- text, thus, based on matching characterization value, can select suitable received text be matched with it is to be identified In text, the identifying processing to text to be identified is realized.

In order to clearly illustrate the above-mentioned text recognition method in the embodiment of the present application, carried out below in conjunction with example detailed Explanation.

Specifically：

One, inverted index library

Inverted index library in the embodiment of the present application, can be relevant database, such as：MySQL, Hbase etc., can also be with It is file, such as：Txt file, Excel file etc. do not constitute the restriction of the application here.

In practical applications, inverted index library can be generated based on base text library.Wherein, the base text library, It is regarded as being stored with the database of a large amount of natural language texts and business text, as previously mentioned, business text can be industry It is engaged in the service-specific word of provider itself, dedicated phrase etc..Data in inverted index library may originate from base text library.Certainly, In the embodiment of the present application, a certain database is created as basic text library it is not absolutely required to independent, it, can be by industry when practical application Business provider backstage is stored with the database of benchmark service text, is stored with the database of natural language text, directly as base Plinth text library.Certainly, such embodiment does not constitute the restriction to the application.

As one of practical application feasible pattern, inverted index library can be by the way of tables of data, such as the following table 1 institute Show, is the inverted index library under which：

Table 1

It can be seen in table 1 that the Standard Segmentation text in inverted index library may include word, word, phrase and/or sentence (in table 1 Shown in text only by taking word as an example).For each Standard Segmentation text being stored in inverted index library, all have unique Text label.It should be noted that Standard Segmentation text shown in table 1, it can be by segmenting tool (participle accordingly In tool and foregoing teachings, used participle tool is identical when being split to text to be identified) it is directed to received text in advance It is obtained after executing text segmentation processing.

In inverted index library, each content that received text is concentrated is all made of " text, the format of code, type " Expression, it should be appreciated that the format is only a kind of example, and different format structures can be used in practical application.

Specifically, " text " in the structure, is exactly received text described in the embodiment of the present application, the received text Include or part includes corresponding Standard Segmentation text.It just as shown in table 1, include mark in text " electric business class refund amount " Standard divides text " electric business ", includes Standard Segmentation text " consumption " in text " pre-capita consumption ".

" code " indicates text coding corresponding in base text library, that is to say, that, can be in base by the coding Content of text corresponding to the coding or relevant information are found in plinth text library.Such as：It, can be in base according to coding " KPI0001 " Text " electric business refund amount " is found in plinth text library.

" type " indicates text type corresponding to received text.In the embodiment of the present application, the text type can To include：Business object name, attribute-name, attribute value.

Wherein, business object is regarded as object involved in the different business service of business provider.Including but It is not limited to：Service product, user, all kinds of operational indicators etc..Correspondingly, business object name is exactly the specific name of business object Claim, such as：Service product name, user name, operational indicator name etc..

Attribute-name is regarded as the title of service attribute possessed by business object, such as：For business object " shop " Speech, attribute-name may include：Store address, pre-capita consumption etc..For another example：For business object " user ", attribute-name can Including：Gender, age etc..

Attribute value is regarded as value corresponding to above-mentioned attribute-name, such as：The value of " gender " includes：Male, women.

To sum up, such as：Assuming that " pre-capita consumption of KFC Hangzhou Wen Sandian is how many ", then, text " Ken De therein The text type of base Hangzhou Wen Sandian " is：The text type of business object name (that is, store name), text " pre-capita consumption " is：Belong to Property name.

Two, standard scores dictionary

The standard scores dictionary, equally can be by the way of relevant database/table.It is interior in standard scores dictionary Hold, it is believed that be based on the obtained word combination of N-gram word segmentation processing.It, can be according to the needs of practical application when practical application The value of N is set, is not especially limited here.

It should be noted that the content stored in standard scores dictionary, the usually word segmentation result of received text.

As one of the embodiment of the present application feasible pattern, standard scores dictionary can (table 2 be binary text as shown in table 2 below Faku County)：

Text label	Text	Participle mode
			1	Electric business consumption	2-gram
2	Refund amount	2-gram
			3	Spending amount	2-gram
4	Pre-capita consumption	2-gram

Table 2

Text in table 2 typically originates from received text above-mentioned, such as " refund amount ", " pre-capita consumption ", indicates The normalization of received text is divided.When practical application, in text to be identified may include received text, but by participle tool into After row segmentation, the segmentation text such as " reimbursement ", " amount of money ", " per capita ", " consumption " is just formed, then, in the splicing of text, Standard scores dictionary as shown in Table 2 can provide reference to the splicing of segmentation text, at the same time as one in splicing Kind Rule of judgment (will illustrate in subsequent content herein).

Participle mode in table 2 is two-dimensional grammar participle, in practical applications, can also use other kinds of participle language Method, 3-gram (three metagrammars participle), 4-gram (four metagrammars participle) etc..

Certainly, table 2 is only a kind of example of standard scores dictionary, should not constitute the restriction to the application.

Therefore the process for pre-establishing received text library may include：Obtain received text, to the received text into Row dividing processing generates Standard Segmentation text；

Using each Standard Segmentation text as index, statistics includes or part includes the received text of the index, forms mark Quasi- text set, and the corresponding relationship of the received text collection and the index is established, it is described based on corresponding relationship foundation Arrange index database.

According to the participle grammer of setting, the text combination of the participle grammer is determined for compliance in the Standard Segmentation text, The standard scores dictionary is established based on the text combination.

In the embodiment of the present application, splicing is carried out to the segmentation text, specially：According to multiple segmentation texts It puts in order, selected segmentation text is spliced.

Wherein, the multiple to divide putting in order for text, with the text alignment sequence consensus in the text to be identified.

Selected segmentation text is spliced, specially：Selected starting segmentation text is searched in the inverted index library Index corresponding with starting segmentation text, if finding index corresponding with starting segmentation text, it is determined that the index Corresponding received text collection is spliced according to the received text that received text is concentrated；Divide text with the starting if not finding This corresponding index selectes the latter segmentation text as starting segmentation text, and divides text according to again selected starting Spliced.

Spliced according to the received text that received text is concentrated, specially：It is selected to be arranged in the starting segmentation text Adjacent segmentation text later carries out accumulative splicing, generates splicing text, determines the splicing text and corresponding received text Difference characterization value between the received text of concentration records the splicing text according to the difference characterization value, and sentences It is disconnected whether to terminate to be spliced with selected starting segmentation text, if so, screening splicing text, and starting segmentation is selected again Text；Otherwise, then the adjacent segmentation text being arranged in after the splicing text is selected, accumulative splicing is continued.

Determine the difference characterization value spliced between the received text that text is concentrated with corresponding received text, specifically For：The editing distance between received text concentrated according to the splicing text with corresponding received text, calculates the difference Characterization value.

The splicing text is recorded, specially：When the splicing text meets the record condition of setting, to institute Splicing text is stated to be recorded.

Wherein, the record condition of the setting, including：Splice the corresponding difference characterization value of text, less than current minimum Difference characterization value and the difference of setting characterize threshold value.

Judge whether to terminate to be spliced with selected starting segmentation text, specially：When the difference characterization value is discontented When foot imposes a condition, determine to terminate to be spliced with selected starting segmentation text；When the difference characterization value meets setting item When part, determine to continue to be spliced with selected starting segmentation text；

Wherein, the setting condition includes：

Minimum difference characterization value does not continue to increase setting number, and

Minimum difference characterization value corresponding to current splicing text, no more than minimum difference corresponding to previous splicing text Characterization value, or, the text combination between currently selected fixed segmentation text and latter segmentation text, is present in standard scores dictionary In.

In conjunction with the content in above-mentioned inverted index library and standard scores dictionary, now it is illustrated for identification process：

Specifically for example：Assuming that the text to be identified of user's input is " the electric business amount of consumption of female user sorts by constellation ", So, after being split using existing dividing method to the text to be identified, available following segmentation text：

The electric business amount of consumption of female user sorts by constellation

As it can be seen that having obtained 9 segmentation texts after text segmentation.In this 9 segmentation texts, comprising by erroneous segmentation Text, therefore will be for this 9 segmentation texts progress splicings.It is noted herein that the gained after text segmentation The multiple segmentation texts arrived, substantially ordered arrangement (that is, form segmentation text sequence, put in order with it is to be identified The sequence consensus of text), so will also be based on the segmentation text sequence when being spliced.

Firstly, selecting first segmentation text (namely starting segmentation text), that is, " female in above-mentioned segmentation text sequence Property ".Index corresponding with the segmentation text is searched in inverted index library.It is assumed herein that using table 1 as inverted index library, that , according to the segmentation text, the index " women " in table 1 can be hit.Further, index " women " can be determined in table 1 Corresponding received text collection, it only includes a received text which, which concentrates,：" women ".Obviously, the segmentation text with Received text accurately matches, and matching characterization value is 1.0 (difference characterization value is 0).At this point, can also be obtained to this splicing Splicing result recorded, such as：Record starting segmentation text, received text, matching characterization value and the text class of this splicing Type etc..Here the restriction to the application should not constituted.

It should be noted that in the embodiment of the present application, to continue to be spliced based on above-mentioned starting segmentation text, Need to meet the splicing condition of setting, wherein the splicing condition may include：

Condition one, minimum difference characterization value do not continue to increase setting number (for convenient for statement, here by the secondary number scale of setting For " thresA ").In general, the value of thresA is 2.Specifically, thresA judge whether to continue to splice as one kind it is adjacent Divide the condition of text, the value of the thresA can also provide the support to fuzzy matching.

Minimum difference characterization value corresponding to condition two, current splicing text, no more than corresponding to previous splicing text Minimum difference characterization value, alternatively, the text combination between currently selected fixed segmentation text and latter segmentation text, is present in mark In quasi- participle library.

In any splicing, above-mentioned condition need to be met simultaneously, can just be continued based on starting point selected before It cuts text to be spliced, otherwise, starting segmentation text will be selected again and spliced.

Later, it for segmentation text sequence, selectes latter segmentation text and continues to splice with starting segmentation text, that is, Selected " user " splices with selected before " women ", obtains splicing text " female user ".At this point, the splicing text is still It corresponding to above-mentioned received text collection, can not accurately be matched with received text " women ", difference characterization value between the two It is 0.5.

Clearly as currently splicing the difference characterization value 0.5 of text " female user ", it is greater than previous splicing text " women " Difference characterization value 0, therefore be unsatisfactory for above-mentioned condition two, thus, starting segmentation text will be selected again.

In this example, segmentation text " user " is selected as new starting and divide text, and look into inverted index library The index to match, and miss are looked for, therefore continues to select starting segmentation text again.

Selected segmentation text " " as new starting divide text, equally, the simultaneously miss in inverted index library, still after It is continuous to select starting segmentation text again.

Segmentation text " electric business " is selected at this time as new starting segmentation text has hit index in inverted index library " electric business ".Further, it can determine that received text collection corresponding to index " electric business ", the received text concentrate packet in table 1 Containing two received texts：" electric business refund amount " and " electric business spending amount ".Starting can be divided text " electric business " conduct Splice text, and calculates separately the difference characterization value between received text " electric business refund amount " and " electric business spending amount ". Wherein, the difference characterization value spliced between text " electric business " and received text " electric business refund amount " and " electric business spending amount " is equal It is 4/6, i.e., 0.67.So, currently splicing minimum difference characterization value corresponding to text " electric business " is just 0.67.Similarly, may be used The splicing result obtained to this splicing records, and no longer excessively repeats here.

The minimum difference characterization value 0.67 of this splicing result does not continue to increase thresA times (meeting condition one), and " electricity Quotient " and its latter segmentation text " consumption " are recorded in standard scores dictionary, that is, (meet condition two) in table 2.So base will be continued Spliced in " electric business " and (enters next round to splice).

Selected to be arranged in " electric business " the latter segmentation text " consumption " and spliced, obtaining new splicing text, " electric business disappears Take ".The splicing text has hit index " electric business " (herein it should be noted that in splicing text by more in inverted index library In the case where a segmentation text composition, it will be looked into inverted index library based on the first segmentation text in the splicing text Look for), corresponding two received texts are still：" electric business refund amount " and " electric business spending amount ".Calculate difference characterization value point It is not：4/6 (0.67) and 2/6 (0.33).In the case, difference characterization value 0.33 is less than current minimum difference characterization value 0.67, therefore current minimum difference characterization value is updated to 0.33 (that is, received text " electric business spending amount " and current Splice text " electric business consumption " more to match).Record the splicing result of this splicing.

The minimum difference characterization value of this splicing result is reduced to 0.33 from 0.67, does not occur continuing to increase thresA times The case where (meet condition one), and updated minimum difference characterization value 0.33 be less than last round of splicing minimum difference characterization value 0.67 (meeting condition two).So will continue to be spliced based on " electric business consumption " and (enter next round to splice).

It is selected to be arranged in " consumption " the latter segmentation text " amount of money " and carry out accumulative splicing, it is " electric to obtain new splicing text Quotient's spending amount ".Foregoing teachings can be referred to the treatment process of the splicing text, it may be determined that splicing text " the electric business consumption The difference characterization value of volume " and received text " electric business spending amount " is 0.17.Correspondingly, by current minimum difference characterization value 0.33 is updated to 0.17.And record this splicing result.

Subsequent process and so on, which is not described herein again.

What needs to be explained here is that as a kind of feasible embodiment, in above-mentioned splicing, record concatenation knot Fruit can be by the way of tables of data.

Such as：For above-mentioned splicing text " women ", splicing result can be as shown in table 3 below.

Serial number	Splice text	Received text	Match characterization value
				1	Women	Women	1.0

Table 3

The splicing result generated in splicing later, can bulk registration in table 3.Such as：Such as the following table 4 institute Show.

Serial number	Splice text	Received text	Match characterization value
				1	Women	Women	1.0
2	Electric business	Electric business spending amount	0.33
				3	Electric business	Electric business refund amount	0.33

Table 4

It is to be noted that in the embodiment of the present application, any splicing result all bulk registrations not obtained are above-mentioned Table in, but after different splicing results meet certain record condition, upper table is updated.Specifically, the note Record condition may include：

Splice the corresponding difference characterization value of text<Current minimum difference characterization value.

Certainly, it is contemplated that the quantity that the splicing result of the condition can be able to satisfy in practical application scene is more, so, The quantity of record will be will increase, in order to guarantee treatment effeciency, additional additional conditions can be increased in above-mentioned record condition, I.e.：Splice the corresponding difference characterization value of text<Difference characterization threshold value is set (here to characterize the difference of setting convenient for statement Threshold value is denoted as " thresB ").In general, the value of thresB is 0.93.

Under the action of additional additional conditions, a certain number of splicing results can be directly filtered out, are reduced meaningless Operation.It should be understood, of course, that the value of thresB is higher, the splicing result directly filtered out is fewer (to have more splicing results may It is recorded), recognition accuracy can increased to a certain degree, but reduce treatment effeciency；Conversely, the splicing directly filtered out As a result more (less splicing result is recorded) are reducing recognition accuracy to a certain degree, but are able to ascend processing effect Rate.That is, the value of thresB, can be set according to the needs of practical application (0.93 in this example, it is only a kind of More excellent value).

Continue this example, it is assumed that after splicing, obtained splicing result is as shown in table 5 below.

Serial number	Splice text	Received text	Match characterization value
				1	Women	Women	1.0
2	Constellation	Constellation	1.0
				3	The electric business amount of consumption	Electric business spending amount	0.83
4	Electric business consumption	Electric business spending amount	0.67
				5	Electric business consumption	Electric business refund amount	0.33
6	Electric business	Electric business spending amount	0.33

Table 5

At this point, according to matching characterization threshold value (for matching characterization threshold value is denoted as " thresC " here convenient for statement), to table Splicing result in 5 is screened, it is assumed that and the value of thresC is 0.6, then, selection matching characterization value is each greater than 0.6 Received text is normalized.

At this point, the segmentation result that obtains that treated：

The electric business spending amount of female user sorts by constellation

Certainly, signable corresponding crucial out when generating recognition result as one of the embodiment of the present application mode The text type of text, such as：

{"name":" women ", " type ":" attribute value ", " score ":1.0}

{"name":" constellation ", " type ":" attribute-name ", " score ":1.0}

{"name":" electric business spending amount ", " type ":" business object name ", " score ":0.83}

Certainly, in various embodiments, there can also be different forms, such as：Anti- document frequency can be increased (Inverse Document Frequency, IDF) information etc., should not constitute the restriction to the application here.

The above are text recognition methods provided by the embodiments of the present application, are based on same thinking, and the embodiment of the present application also mentions For a kind of text identification device, as shown in Figure 2.The device includes：

Text segmentation module 201 is split text to be identified, generates segmentation text；

Text splicing module 202, according to the received text stored in the received text library pre-established, to the segmentation text This progress splicing generates splicing text；Wherein, it in the received text library, is at least stored with and partly or entirely divides The corresponding received text collection of text, it includes at least one received text that the received text, which is concentrated,；

Score value determining module 203 determines the matching characterization value between the splicing text and corresponding received text；

Result-generation module 204, according to the matching characterization value, the selected standard text to match with the splicing text This, generates recognition result.

The received text library includes at least：Inverted index library and standard scores dictionary, described device further include：Text library wound Block 205 is modeled, received text is obtained, processing is split to the received text and generates Standard Segmentation text, by each standard Divide text and be used as index, statistics includes or part includes the received text of the index, forms received text collection, and described in foundation The corresponding relationship of received text collection and the index establishes the inverted index library based on the corresponding relationship, according to setting Grammer is segmented, the text combination of the participle grammer is determined for compliance in the Standard Segmentation text, is built jointly based on the group of text Found the standard scores dictionary；

Wherein, the participle grammer of the setting includes at least：Two-dimensional grammar, three metagrammars.

The text splicing module 202, according to putting in order for multiple segmentation texts, selected segmentation text is spliced.

The text splicing module 202 is selected starting segmentation text and is searched and the starting in the inverted index library Divide the corresponding index of text；

If finding index corresponding with starting segmentation text, it is determined that described to index corresponding received text collection, root Spliced according to the received text that received text is concentrated；

If not finding index corresponding with starting segmentation text, the latter segmentation text is selected as starting segmentation text This, and spliced according to again selected starting segmentation text.

The text splicing module 202, the selected adjacent segmentation text being arranged in after the starting segmentation text, carries out Accumulative splicing generates splicing text, determines the difference spliced between the received text that text is concentrated with corresponding received text Different characterization value records the splicing text according to the difference characterization value, and judges whether to terminate with selected starting Segmentation text is spliced；

If so, screening splicing text, and starting segmentation text is selected again；

Otherwise, then the adjacent segmentation text being arranged in after the splicing text is selected, accumulative splicing is continued.

The text splicing module 202, the received text concentrated according to splicing text and the corresponding received text it Between editing distance, calculate the difference characterization value.

The text splicing module 202, when the splicing text meets the record condition of setting, to the splicing text It is recorded.

The text splicing module 202 determines to terminate to select when the difference characterization value is unsatisfactory for imposing a condition Starting segmentation text is spliced, and when the difference characterization value, which meets, to impose a condition, determines to continue to divide with selected starting Text is spliced.

Wherein, the setting condition includes：Minimum difference characterization value does not continue to increase setting number, and

The score value determining module 203, the received text concentrated according to splicing text and the corresponding received text it Between editing distance, calculate the matching characterization value.

In the 1990s, the improvement of a technology can be distinguished clearly be on hardware improvement (for example, Improvement to circuit structures such as diode, transistor, switches) or software on improvement (improvement for method flow).So And with the development of technology, the improvement of current many method flows can be considered as directly improving for hardware circuit. Designer nearly all obtains corresponding hardware circuit by the way that improved method flow to be programmed into hardware circuit.Cause This, it cannot be said that the improvement of a method flow cannot be realized with hardware entities module.For example, programmable logic device (Programmable Logic Device, PLD) (such as field programmable gate array (Field Programmable Gate Array, FPGA)) it is exactly such a integrated circuit, logic function determines device programming by user.By designer Voluntarily programming comes a digital display circuit " integrated " on a piece of PLD, designs and makes without asking chip maker Dedicated IC chip.Moreover, nowadays, substitution manually makes IC chip, this programming is also used instead mostly " is patrolled Volume compiler (logic compiler) " software realizes that software compiler used is similar when it writes with program development, And the source code before compiling also write by handy specific programming language, this is referred to as hardware description language (Hardware Description Language, HDL), and HDL is also not only a kind of, but there are many kind, such as ABEL (Advanced Boolean Expression Language)、AHDL(Altera Hardware Description Language)、Confluence、CUPL(Cornell University Programming Language)、HDCal、JHDL (Java Hardware Description Language)、Lava、Lola、MyHDL、PALASM、RHDL(Ruby Hardware Description Language) etc., VHDL (Very-High-Speed is most generally used at present Integrated Circuit Hardware Description Language) and Verilog.Those skilled in the art also answer This understands, it is only necessary to method flow slightly programming in logic and is programmed into integrated circuit with above-mentioned several hardware description languages, The hardware circuit for realizing the logical method process can be readily available.

Controller can be implemented in any suitable manner, for example, controller can take such as microprocessor or processing The computer for the computer readable program code (such as software or firmware) that device and storage can be executed by (micro-) processor can Read medium, logic gate, switch, specific integrated circuit (Application Specific Integrated Circuit, ASIC), the form of programmable logic controller (PLC) and insertion microcontroller, the example of controller includes but is not limited to following microcontroller Device：ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20 and Silicone Labs C8051F320, are deposited Memory controller is also implemented as a part of the control logic of memory.It is also known in the art that in addition to Pure computer readable program code mode is realized other than controller, can be made completely by the way that method and step is carried out programming in logic Controller is obtained to come in fact in the form of logic gate, switch, specific integrated circuit, programmable logic controller (PLC) and insertion microcontroller etc. Existing identical function.Therefore this controller is considered a kind of hardware component, and to including for realizing various in it The device of function can also be considered as the structure in hardware component.Or even, it can will be regarded for realizing the device of various functions For either the software module of implementation method can be the structure in hardware component again.

System, device, module or the unit that above-described embodiment illustrates can specifically realize by computer chip or entity, Or it is realized by the product with certain function.It is a kind of typically to realize that equipment is computer.Specifically, computer for example may be used Think personal computer, laptop computer, cellular phone, camera phone, smart phone, personal digital assistant, media play It is any in device, navigation equipment, electronic mail equipment, game console, tablet computer, wearable device or these equipment The combination of equipment.

For convenience of description, it is divided into various units when description apparatus above with function to describe respectively.Certainly, implementing this The function of each unit can be realized in the same or multiple software and or hardware when application.

It should be understood by those skilled in the art that, the embodiment of the present invention can provide as method, system or computer program Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the present invention Apply the form of example.Moreover, it wherein includes the computer of computer usable program code that the present invention, which can be used in one or more, The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces The form of product.

The present invention be referring to according to the method for the embodiment of the present invention, the process of equipment (system) and computer program product Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.

These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.

These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.

In a typical configuration, calculating equipment includes one or more processors (CPU), input/output interface, net Network interface and memory.

Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/or The forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM).Memory is computer-readable medium Example.

Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method Or technology come realize information store.Information can be computer readable instructions, data structure, the module of program or other data. The example of the storage medium of computer includes, but are not limited to phase change memory (PRAM), static random access memory (SRAM), moves State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable Programmable read only memory (EEPROM), flash memory or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM), Digital versatile disc (DVD) or other optical storage, magnetic cassettes, tape magnetic disk storage or other magnetic storage devices Or any other non-transmission medium, can be used for storage can be accessed by a computing device information.As defined in this article, it calculates Machine readable medium does not include temporary computer readable media (transitory media), such as the data-signal and carrier wave of modulation.

It should also be noted that, the terms "include", "comprise" or its any other variant are intended to nonexcludability It include so that the process, method, commodity or the equipment that include a series of elements not only include those elements, but also to wrap Include other elements that are not explicitly listed, or further include for this process, method, commodity or equipment intrinsic want Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including described want There is also other identical elements in the process, method of element, commodity or equipment.

It will be understood by those skilled in the art that embodiments herein can provide as method, system or computer program product. Therefore, complete hardware embodiment, complete software embodiment or embodiment combining software and hardware aspects can be used in the application Form.It is deposited moreover, the application can be used to can be used in the computer that one or more wherein includes computer usable program code The shape for the computer program product implemented on storage media (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) Formula.

The application can describe in the general context of computer-executable instructions executed by a computer, such as program Module.Generally, program module includes routine, programs, objects, the group for executing particular transaction or realizing particular abstract data type Part, data structure etc..The application can also be practiced in a distributed computing environment, in these distributed computing environments, by Affairs are executed by the connected remote processing devices of communication network.In a distributed computing environment, program module can be with In the local and remote computer storage media including storage equipment.

All the embodiments in this specification are described in a progressive manner, same and similar portion between each embodiment Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for system reality For applying example, since it is substantially similar to the method embodiment, so being described relatively simple, related place is referring to embodiment of the method Part explanation.

The above description is only an example of the present application, is not intended to limit this application.For those skilled in the art For, various changes and changes are possible in this application.All any modifications made within the spirit and principles of the present application are equal Replacement, improvement etc., should be included within the scope of the claims of this application.

Claims

1. a kind of text recognition method, including：

Text to be identified is split, segmentation text is generated；

According to the received text stored in the received text library pre-established, splicing is carried out to the segmentation text, is generated Splice text；Wherein, in the received text library, at least it is stored with received text corresponding with partly or entirely segmentation text Collection, it includes at least one received text that the received text, which is concentrated,；

2. the method as described in claim 1, the received text library is included at least：Inverted index library and standard scores dictionary；

Received text library is pre-established, is specifically included：

Obtain received text；

Processing is split to the received text and generates Standard Segmentation text；

Using each Standard Segmentation text as index, statistics includes or part includes the received text of the index, forms standard text This collection, and the corresponding relationship of the received text collection and the index is established, the row's of falling rope is established based on the corresponding relationship Draw library；

According to the participle grammer of setting, it is determined for compliance with the text combination of the participle grammer in the Standard Segmentation text, is based on The text combination establishes the standard scores dictionary；

3. method according to claim 2 carries out splicing to the segmentation text, specifically includes：

According to putting in order for multiple segmentation texts, selected segmentation text is spliced；

4. method as claimed in claim 3, selected segmentation text is spliced, and is specifically included：

Selected starting segmentation text；

In the inverted index library, index corresponding with starting segmentation text is searched；

If finding index corresponding with starting segmentation text, it is determined that it is described to index corresponding received text collection, according to mark Received text in quasi- text set is spliced；

If not finding index corresponding with starting segmentation text, selectes the latter segmentation text and is used as starting segmentation text, And spliced according to again selected starting segmentation text.

5. method as claimed in claim 4 is spliced according to the received text that received text is concentrated, is specifically included：

The selected adjacent segmentation text being arranged in after the starting segmentation text, carries out accumulative splicing, generates splicing text；

Determine the difference characterization value spliced between the received text that text is concentrated with corresponding received text；

According to the difference characterization value, the splicing text is recorded, and judges whether to terminate to divide with selected starting Text is spliced；

6. method as claimed in claim 5, determine received text that the splicing text and corresponding received text are concentrated it Between difference characterization value, specifically include：

The editing distance between received text concentrated according to the splicing text with corresponding received text, calculates the difference Characterization value.

7. method as claimed in claim 5 records the splicing text, specifically includes：

When the splicing text meets the record condition of setting, the splicing text is recorded；

8. method as claimed in claim 5 judges whether to terminate to be spliced with selected starting segmentation text, specific to wrap It includes：

When the difference characterization value is unsatisfactory for imposing a condition, determine to terminate to be spliced with selected starting segmentation text；

When the difference characterization value, which meets, to impose a condition, determine to continue to be spliced with selected starting segmentation text；

Minimum difference characterization value corresponding to current splicing text, characterizes no more than minimum difference corresponding to previous splicing text Value, or, the text combination between currently selected fixed segmentation text and latter segmentation text, is present in standard scores dictionary.

9. method as claimed in claim 5 determines the matching characterization value between the splicing text and corresponding received text, It specifically includes：

The editing distance between received text concentrated according to the splicing text with corresponding received text, calculates the matching Characterization value.

10. a kind of text identification device, including：

Text splicing module carries out the segmentation text according to the received text stored in the received text library pre-established Splicing generates splicing text；Wherein, it in the received text library, is at least stored with and partly or entirely divides text pair The received text collection answered, it includes at least one received text that the received text, which is concentrated,；

Result-generation module, according to the matching characterization value, the selected received text to match with the splicing text generates and knows Other result.

11. device as claimed in claim 10, the received text library is included at least：Inverted index library and standard scores dictionary；

Described device further includes：Text library creation module obtains received text, is split processing to the received text and generates Standard Segmentation text, using each Standard Segmentation text as index, statistics includes or part includes the received text of the index, shape At received text collection, and the corresponding relationship of the received text collection and the index is established, institute is established based on the corresponding relationship Inverted index library is stated, according to the participle grammer of setting, the text of the participle grammer is determined for compliance in the Standard Segmentation text Combination, establishes the standard scores dictionary based on the text combination；

12. device as claimed in claim 11, the text splicing module is selected according to putting in order for multiple segmentation texts Surely segmentation text is spliced；

13. device as claimed in claim 12, the text splicing module select starting segmentation text, in the row's of falling rope Draw in library, searches index corresponding with starting segmentation text；

14. device as claimed in claim 13, the text splicing module is selected to be arranged in after the starting segmentation text Adjacent segmentation text, carry out accumulative splicing, generate splicing text, determine that the splicing text is concentrated with corresponding received text Received text between difference characterization value the splicing text is recorded according to the difference characterization value, and judge be No end is spliced with selected starting segmentation text；

15. device as claimed in claim 14, the text splicing module, according to the splicing text and corresponding standard text Editing distance between the received text of this concentration calculates the difference characterization value.

16. device as claimed in claim 14, the text splicing module meet the record strip of setting in the splicing text When part, the splicing text is recorded；

17. device as claimed in claim 14, the text splicing module, when the difference characterization value is unsatisfactory for imposing a condition When, determine to terminate to be spliced with selected starting segmentation text, when the difference characterization value, which meets, to impose a condition, determine after It is continuous to be spliced with selected starting segmentation text；

18. device as claimed in claim 14, the score value determining module, according to the splicing text and corresponding standard text Editing distance between the received text of this concentration calculates the matching characterization value.