CN110162681A - Text identification, text handling method, device, computer equipment and storage medium - Google Patents

Text identification, text handling method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN110162681A
CN110162681A CN201811168737.9A CN201811168737A CN110162681A CN 110162681 A CN110162681 A CN 110162681A CN 201811168737 A CN201811168737 A CN 201811168737A CN 110162681 A CN110162681 A CN 110162681A
Authority
CN
China
Prior art keywords
target
text
word
degree
domain
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811168737.9A
Other languages
Chinese (zh)
Other versions
CN110162681B (en
Inventor
黄子轩
王军伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201811168737.9A priority Critical patent/CN110162681B/en
Publication of CN110162681A publication Critical patent/CN110162681A/en
Application granted granted Critical
Publication of CN110162681B publication Critical patent/CN110162681B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The present invention relates to a kind of text identification, text handling method, device, computer equipment and storage medium, the text handling method includes: to obtain initial input text;Obtain the corresponding incidence relation of the corresponding target domain of the initial input text, incidence relation of the incidence relation between domain term and mapping character, the domain term are identified according to the corresponding target text set of text to be identified, general field text collection and the target domain of the target domain;The corresponding target domain word of the initial input text is determined according to the initial input text and the incidence relation;Target input text is obtained according to the whole initial input text of the target domain tone.The above method is high for the accuracy for the target input text that specific area adjusts.

Description

Text identification, text handling method, device, computer equipment and storage medium
Technical field
The present invention relates to internet areas, more particularly to text identification, text handling method, device, computer equipment And storage medium.
Background technique
With the fast development of internet, the problem of information overload, becomes increasingly conspicuous.The word occurred in network is more and more, There are the needs for the information that the information that user inputs is adjusted to actual needs input under many scenes, for example, according to input Phonetic show candidate word or error correction etc. carried out to the word of user's input.
Currently, when the information for needing to be inputted according to user determines the information of actual needs input, usually from dictionary The similar words of the word of user's input or the word with similar pinyin are screened, therefore the word quantity that screening obtains is more, and The information relevance that often actually enters with user is little, and accuracy is low.
Summary of the invention
Based on this, it is necessary to for above-mentioned problem, provide a kind of text identification, text handling method, device, computer Equipment and storage medium, due to can be according to text to be identified, general field text collection and the corresponding target of text to be identified The corresponding text collection in field identifies to obtain the domain term of target domain, therefore the domain term and the phase of target domain that identification obtains Pass degree is big, and the accuracy of text identification and text-processing is high.
A kind of text recognition method, which comprises text to be identified is obtained, according to the word in the text to be identified Symbol obtains target candidate word;Obtain the target text of general field text collection and the corresponding target domain of the text to be identified This set;Calculate target different degree of the target candidate word in the target text set and in the general field text The reference different degree of this set;Institute is calculated according to the corresponding target different degree of the target candidate word and with reference to different degree State the target degree of correlation of target candidate word Yu the target domain;According to the target degree of correlation using the target candidate word as The domain term of the target domain.
In one embodiment, described according to the corresponding target different degree of the target candidate word and with reference to different degree meter It includes: according to the corresponding mesh of the target candidate word that calculation, which obtains the target candidate word and the target degree of correlation of the target domain, Mark different degree and the initial degree of correlation that the target candidate word Yu the target domain are calculated with reference to different degree;According to institute It states frequency of occurrence of the target candidate word in the target text set and determines corresponding degree of correlation confidence level;According to described initial The degree of correlation and the degree of correlation confidence level obtain the target degree of correlation.
In one embodiment, the text handling method further include: detect the corresponding target of the target input text Type;When the corresponding target type of target input text is preset kind, the initial input text is filtered.
A kind of text identification device, described device includes: that target candidate word obtains module, for obtaining text to be identified, Target candidate word is obtained according to the character in the text to be identified;Set obtains module, for obtaining general field text set The target text set of conjunction and the corresponding target domain of the text to be identified;Different degree computing module, it is described for calculating Target different degree of the target candidate word in the target text set and the reference weight in the general field text collection It spends;The degree of correlation obtains module, for calculating according to the corresponding target different degree of the target candidate word and with reference to different degree Obtain the target degree of correlation of the target candidate word Yu the target domain;Domain term obtains module, for according to the target The degree of correlation is using the target candidate word as the domain term of the target domain.
A kind of computer equipment, including memory and processor are stored with computer program, the meter in the memory When calculation machine program is executed by the processor, so that the step of processor executes above-mentioned text recognition method.
A kind of computer readable storage medium is stored with computer program on the computer readable storage medium, described When computer program is executed by processor, so that the step of processor executes above-mentioned text recognition method.
Above-mentioned text recognition method, device, computer equipment and storage medium.When needing to carry out word identification, by obtaining Text to be identified is taken, target candidate word is obtained according to the character in text to be identified;Obtain general field text collection and to Identify the target text set of the corresponding target domain of text;It is important to calculate target of the target candidate word in target text set Degree and general field text collection reference different degree;According to the corresponding target different degree of target candidate word and with reference to weight Spend the target degree of correlation that target candidate word and target domain is calculated;According to the target degree of correlation using target candidate word as mesh The domain term in mark field.Due to obtaining target candidate word according to text to be identified, and target candidate word is in the text of target domain Set compares with the different degree in the text collection of general field, can embody target candidate word journey related to target domain Degree, therefore the relevant domain term of accurate target domain corresponding to text to be identified can be obtained, accuracy height.
A kind of text handling method, which comprises obtain initial input text;Obtain the initial input text pair The corresponding incidence relation of the target domain answered, incidence relation of the incidence relation between domain term and mapping character are described Domain term is corresponding according to the corresponding text to be identified of the target domain, general field text collection and the target domain Target text set identify;The initial input text is determined according to the initial input text and the incidence relation This corresponding target domain word;Target input text is obtained according to the whole initial input text of the target domain tone.
A kind of text processing apparatus, described device includes: that initial input text obtains module, for obtaining initial input text This;Incidence relation obtains module, described for obtaining the corresponding incidence relation of the corresponding target domain of the initial input text Incidence relation of the incidence relation between domain term and mapping character, the domain term be according to the target domain it is corresponding to What identification text, general field text collection and the corresponding target text set of the target domain identified;Target domain Word obtains module, for determining the corresponding mesh of the initial input text according to the initial input text and the incidence relation Mark domain term;Target input text obtains module, for being obtained according to the whole initial input text of the target domain tone Target inputs text.
A kind of computer equipment, including memory and processor are stored with computer program, the meter in the memory When calculation machine program is executed by the processor, so that the step of processor executes above-mentioned text handling method.
A kind of computer readable storage medium is stored with computer program on the computer readable storage medium, described When computer program is executed by processor, so that the step of processor executes above-mentioned text handling method.
Above-mentioned text handling method, device, computer equipment and storage medium, can be corresponding according to the target domain of application Domain term and mapping character between relationship determine the corresponding domain term of the text that inputs in the application, and according to domain term pair Initial input text is adjusted, and obtains target input text.Since domain term is text to be identified, the general neck according to application What the text identification of domain text and target domain obtained, be the relevant word of target domain, therefore adjust for specific area The accuracy of obtained target input text is high.
Detailed description of the invention
Fig. 1 is the applied environment figure of the text handling method and text recognition method that provide in one embodiment;
Fig. 2 is the flow chart of text recognition method in one embodiment;
Fig. 3 A is the flow chart of text recognition method in one embodiment;
Fig. 3 B is the flow chart for the incidence relation established between domain term and mapping character in one embodiment;
Fig. 4 is to obtain the flow chart of target candidate word according to the character in text to be identified in one embodiment;
Fig. 5 is the flow chart of text handling method in one embodiment;
Fig. 6 is to obtain the process that target inputs text according to the whole initial input text of target domain tone in one embodiment Figure;
Fig. 7 is to obtain the schematic diagram of word relation chain in one embodiment;
Fig. 8 is to obtain the schematic diagram that target inputs text according to the transition probability of word relation chain in one embodiment;
Fig. 9 is to show that the corresponding target of initial input text inputs the schematic diagram of text in one embodiment;
Figure 10 is the flow chart of text handling method in one embodiment;
Figure 11 is to carry out error correction to initial input text in one embodiment, obtains the schematic diagram of target input text;
Figure 12 is the structural block diagram of text identification device in one embodiment;
Figure 13 is the structural block diagram of text identification device in one embodiment;
Figure 14 is the structural block diagram of text processing apparatus in one embodiment;
Figure 15 is the internal structure block diagram of computer equipment in one embodiment;
Figure 16 is the internal structure block diagram of computer equipment in one embodiment.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.
It is appreciated that term " first " used in this application, " second " etc. can be used to describe various elements herein, But unless stated otherwise, these elements should not be limited by these terms.These terms are only used to by first element and another yuan Part is distinguished.For example, in the case where not departing from scope of the present application, first threshold can be known as to second threshold, and class As, second threshold can be known as first threshold.
Fig. 1 is the applied environment figure of the text handling method and text recognition method that provide in one embodiment, such as Fig. 1 It is shown, in the application environment, including terminal 110 and server 120.
In one embodiment, when needing to obtain domain term, what user can select to get in advance in terminal 110 Target text set and general field text collection are identified to 120 sending information of server by terminal 110 and are instructed, and are identified Target text set and general field text collection is carried in instruction to obtain after server 120 receives text identification instruction Text to be identified is taken, text recognition method provided in an embodiment of the present invention is executed, obtains the domain term of target domain.Server 120 are stored in the domain term of target domain in dictionary.
Domain term is the proprietary word of specific area, frequently appears in certain specific areas, and seldom uncorrelated at other Field occurs, such as domain term can be the frequency of occurrences in specific area greater than the first predeterminated frequency, in going out for general field Showing word of the frequency less than the second predeterminated frequency, the first predeterminated frequency and the second predeterminated frequency can according to need setting, the One predeterminated frequency is greater than the second predeterminated frequency.Such as the vocabulary such as thread, compiler are the domain term in computer field, are generally existed Occur in the professional article of computer field, and seldom occurs in other uncorrelated field such as medical fields." financing is logical ", Words such as " current incomes " are the domain term in finance and money management field, are generally occurred in the article shown in the application of financial class.One The domain term of a professional domain can be it is continually changing, for example, " financing logical " is title of financing application, for " financing is logical " This word not is the domain term in finance and money management field, but with " financing before " financing is logical " application puts goods on the market It is logical " application put goods on the market after using more and more extensive, " financing is logical " frequency of occurrences is increasingly in the article in finance and money management field Height, " financing is logical " become the domain term in finance and money management field.
In one embodiment, when user inquires, for example, when needing to seek advice from intelligence visitor in shopping application When taking related problem, user input inquiry sentence, terminal 110 can send server for query statement in terminal 110 In 120, server 120 executes text handling method provided in an embodiment of the present invention using query statement as initial input text, Target input text is obtained, and obtains the corresponding inquiry response data of target input text, inquiry response data are returned into end End 110, terminal 110 show the inquiry response data.
It is appreciated that above-mentioned application environment is only a kind of example, do not constitute at text provided in an embodiment of the present invention The limitation of reason method.In some embodiments, there may also be other application environments.Such as this hair can also be executed by terminal The text handling method that bright embodiment provides, server 120 can also obtain the model in forum as initial input text, hold Row text handling method provided in an embodiment of the present invention obtains target input text, and inputs text according to target and determine forum In model whether be advertisement, determine the model in forum be advertisement when, filter the advertising information.Server 120 can also be with It is stored in advance in target text set and general field text collection, and obtains new text to be identified every preset duration, Execute text recognition method.
Server 120 can be independent physical server and be also possible to the server set that multiple physical servers are constituted Group can be to provide the Cloud Server of the basic cloud computing service such as Cloud Server, cloud database, cloud storage and CDN.Terminal 110 It can be smart phone, tablet computer, laptop and desktop computer etc., however, it is not limited to this.Terminal 110 and Server 120 can be attached by network.
As shown in Fig. 2, in one embodiment it is proposed that a kind of text recognition method, the present embodiment is mainly in this way Applied in above-mentioned Fig. 1 terminal 110 or server 120 illustrate.It can specifically include following steps:
Step S202 obtains text to be identified, obtains target candidate word according to the character in text to be identified.
Specifically, text to be identified is the text for needing to identify domain term, and text to be identified can be to be instructed according to identification It obtains.Identification instruction can be to be also possible to be touched according to pre-set trigger condition according to the real-time operation of user triggering Hair, pre-set trigger condition for example can be setting and trigger identification instruction every preset duration, to obtain text to be identified This.One or more of text to be identified and the corresponding mark of text to be identified can be carried in identification instruction, for example, knowing Not Zhi Ling in can carry the corresponding storage location of text to be identified, text to be identified is obtained according to storage location.Target candidate Root is obtained according to the character in text to be identified.Target candidate word can be the word of character composition adjacent in text to be identified Language.The number of character specifically can according to need setting in target candidate word, such as can be 2 or 3.
In one embodiment, text to be identified can be in the corresponding application of target domain, each according to application Content in a page obtains text to be identified.Using the content in each page can be application in issue information data, Corresponding one exchanged in the information issued in forum of session sentence and application for carrying out customer service consulting in and generating Or it is multiple.Due to text to be identified in the application, used in the relevant financial information such as issued in financing application, application of managing money matters The sentence etc. that the query statement of family input and customer service are answered generally is the data strong with the field correlation of the application, because This, a possibility that target candidate word using the character composition in text to be identified is domain term, is high, therefore text knowledge can be improved Other efficiency.
It in one embodiment, can be using the candidate word being made of adjacent character each in application as target candidate Word, can also the character to text to be identified further screen and obtain target candidate word.This is had determined as example, can filter The word of the domain term in field, or the candidate word for again forming adjacent character after modal particle in text to be identified is removed are made For target candidate word.
In one embodiment, target candidate word can be generated according to the proximity relations of character in text to be identified, such as will Adjacent character forms target candidate word.Wherein, the number of target candidate word can according to need determination, such as can be 2 or 3 It is a.For an actual example, it is assumed that text to be identified is " abcdefg ", then can regard " ab ", " bc " etc. as target candidate Word can also regard " abc ", " bcd " etc. as target candidate word.If " ab " and is confirmed as the domain term in the field, Then " ab " can not be used as target candidate word.
It in one embodiment, can also be according at least one of word association degree and word independence degree to be identified Target candidate word is chosen in text.For example, word association degree can be selected to be higher than preset value, and word independence degree is higher than preset value Candidate word is as target candidate word.
Step S204 obtains the target text collection of general field text collection and the corresponding target domain of text to be identified It closes, includes general field text in general field text collection.
Specifically, field belonging to text to be identified is target domain, and target domain belonging to text to be identified can root It is determined according to the source of text to be identified, for example, if text to be identified is the financing article in financing application, then text to be identified Affiliated target domain is financing field.Target domain belonging to text to be identified is also possible to be obtained according to the realm information of input It arrives.When needing to identify the domain term in a certain field, the text to be identified and text to be identified of user's input can receive Corresponding field.Target text set is the text collection of target domain, and target domain is the field where text to be identified.Mesh Mark field can be social field, financial field or medical field etc., with specific reference to it needs to be determined that.The text set of target domain It closes can be and imported into server and server acquires from the corresponding application of target domain.It is also possible to from it Target text is obtained in his data source forms target text set.General field text collection is obtained by general field text combination The text collection arrived, general field text refers to the weak text of field specific aim, general for strongly professional text There is no unique suitable application areas for text, but can be blanket.General field text for example can be Domestic News etc. Text can crawl Domestic News as general field text, or by depositing after manually choosing general field text on network Storage is in the server.It is appreciated that general field text collection and target text set, which can store, is executing text identification In the server of method, it is also possible to be stored in other servers.In general field text collection and target text set One or more be also possible to obtain in real time.For example, can be obtained 10 days in the application when needing to carry out text identification The text of interior publication forms target text set.The quantity of text in text collection can according to need setting, for example, general The quantity of text in the text collection of field can be 10,000, and the quantity of text can be 100 in target text set.
In one embodiment, it is formed due to general field text collection for the text of general field, data source ratio It is relatively abundant, and target text collection is combined into the text of target domain, compares less, therefore text in general field text collection Quantity is more than the quantity of text in target text set.Such as text in general field text collection and target text set Quantitative proportion can be 10:1.It, can be using an article as a text, and for question and answer language when calculating the quantity of text Sentence, can be using each query statement as a text, can also be using a complete inquiry session as a text, tool Body can according to need setting.
Step S206 calculates target different degree of the target candidate word in target text set and in general field text The reference different degree of set.
Specifically, different degree is for indicating that significance level of the word in text collection, different degree are bigger, then it represents that the time Select different degree of the word in text collection higher.Target different degree refers to that target candidate word is important in target text set Degree, refers to different degree of the target candidate word in general field text collection with reference to different degree.The calculation method of different degree can be with Determine that frequency of occurrence is more, then corresponding different degree is big according to the frequency of occurrence of target candidate word.For example, can be by frequency of occurrence As different degree, or using the frequency of occurrences obtained according to frequency of occurrence as different degree.
In one embodiment, different degree can be determined according to the frequency of occurrences of the target candidate word in text collection.With Formula is expressed as follows: Pg (w)=Cg (w)/Cg (ALL);Pf (w)=Cf (w)/Cf (ALL).Wherein, w indicates target candidate word, P Indicate frequency, g indicates that general field text collection, f indicate target text set, and C indicates frequency of occurrence.Therefore, Pg (w) is indicated The frequency of occurrences of the target candidate word w in general field text collection g, Pf (w) indicate target candidate word w in target text set The frequency of occurrences in f, Cg (w) indicate frequency of occurrence of the target candidate word w in general field text collection g, and Cf (w) indicates mesh Mark frequency of occurrence of the candidate word w in target text set f.Cg (ALL) indicates of word in general field text collection g Number.Cf (ALL) indicates the number of word in target text set f.Wherein the number of word can be character in text collection Number is also possible to the number of the word obtained after being segmented.
Target candidate is calculated according to the corresponding target different degree of target candidate word and with reference to different degree in step S208 The target degree of correlation of word and target domain.
Specifically, the target degree of correlation indicates that the degree of correlation of target candidate word and target domain, the degree of correlation are bigger, then it represents that The target candidate word is more related to target domain.Target different degree and target degree of correlation positive correlation, with reference to different degree with The target degree of correlation is negatively correlated relationship.I.e. target different degree increases, then the target degree of correlation becomes larger, and refers to different degree and increase, then The target degree of correlation becomes smaller.In one embodiment, the target degree of correlation can be what target different degree was obtained divided by reference different degree Quotient.
In one embodiment, mesh is calculated according to the corresponding target different degree of target candidate word and with reference to different degree Mark candidate word and the target degree of correlation of target domain include: according to the corresponding target different degree of target candidate word and with reference to important The initial degree of correlation of target candidate word and target domain is calculated in degree;According to target candidate word going out in target text set Occurrence number determines corresponding degree of correlation confidence level;The target degree of correlation is obtained according to the initial degree of correlation and degree of correlation confidence level.
Specifically, degree of correlation confidence level indicates the order of accuarcy of the initial degree of correlation, degree of correlation confidence level and target candidate word Frequency of occurrence correlation in target text set.It, may be due to target candidate due to when calculating the degree of correlation Word is smaller in the frequency of occurrence of target text set and general field text collection, and the target degree of correlation obtained from compares High situation, therefore the confidence system of the degree of correlation can be obtained according to frequency of occurrence of the target candidate word in target text set Number, is adjusted the degree of correlation, and the target degree of correlation accuracy made is high.In one embodiment, the meter of the target degree of correlation Calculation method formula can be expressed as follows: X (w)=Pf (w)/Pg (w) * log2(Cf (w)), wherein X (w) indicates that target is related Degree, the initial degree of correlation of the quotient representation that Pf (w)/Pg (w) is obtained, log2Cf (w) indicates degree of correlation confidence level.Due to calculating target When the degree of correlation, it is contemplated that frequency of occurrence of the target candidate word in target text set determines degree of correlation confidence level, therefore energy Access the more accurate target degree of correlation.
Step S210, according to the target degree of correlation using target candidate word as the domain term of target domain.
It specifically, can be using target candidate word as mesh if the target degree of correlation is greater than or is more than or equal to second threshold The domain term in mark field can determine that target candidate word is not target domain if the target degree of correlation is less than second threshold Domain term.Or when the target degree of correlation is less than preset threshold, then whether the target candidate word can be determined in conjunction with other methods For the domain term of target domain.For example, the target degree of correlation is less than second threshold and is shown greater than the target candidate word of third threshold value Show on the display interface of terminal, determines whether the target candidate word is target according to selection operation of the user to target candidate word The domain term in field can be with if selection operation is to confirm that target candidate word is the corresponding operation of domain term of target domain Using target candidate word as the domain term of target domain.
In one embodiment, if target candidate word have it is multiple, can according to the target degree of correlation size from greatly to Small to be ranked up, P target candidate words are as the domain term of target domain before being ordered as.Or m before being ordered as In target candidate word, the target degree of correlation is greater than domain term of the target candidate word of the 4th threshold value as target domain.Wherein, P, m For the integer greater than 1, P, m, second threshold, third threshold value and the 4th threshold value specific value can according to need setting.
Above-mentioned text recognition method, device, computer equipment and storage medium.By obtaining text to be identified, according to Character in identification text obtains target candidate word;Obtain general field text collection and the corresponding target neck of text to be identified The target text set in domain;Calculate target different degree of the target candidate word in target text set and in general field text The reference different degree of set;Target candidate is calculated according to the corresponding target different degree of target candidate word and with reference to different degree The target degree of correlation of word and target domain;According to the target degree of correlation using target candidate word as the domain term of target domain.Due to It is according to obtaining target candidate word in text to be identified, and target candidate word is in the text collection and general field of target domain Different degree in text collection compares, and whether related to target domain can embody target candidate word, therefore can obtain standard The relevant domain term of true target domain corresponding to text to be identified, accuracy are high.
After obtaining domain term, domain term can be stored in the corresponding dictionary of target domain, the neck being stored in dictionary Domain word can be used for judging text whether be target domain text, the word in application can also be adjusted.Such as basis The field of application by user input there are the words of mistake to be modified to corresponding domain term.In one embodiment, work as user In this application after the corresponding pinyin character of input domain term, using the domain term as the corresponding candidate word of pinyin character, to mention The high efficiency for inputting word in the application.
For example, it is assumed that user needs to link up with artificial intelligence customer service in financing application, the sentence of input is that " I will be how Show interest in in showing interest in ", then the sentence can be modified to " I will how the financing in financing is logical ", then obtain that " I will be how The corresponding answer statement of the financing in financing is logical ", and export answer statement.
In one embodiment, as shown in Figure 3A, according to the target degree of correlation using target candidate word as the neck of target domain After the word of domain, further includes:
Step S302 determines the corresponding mapping character of domain term according to mapping relations, mapping relations include shape closely map, sound At least one of nearly mapping.
Specifically, shape closely maps the mapping relations for referring to word similar in character form structure, and sound, which closely maps, to be referred to similar in phonetic notation The mapping relations of character.Whether character form structure is close and the whether similar rule of phonetic notation can according to need setting.For shape Whether nearly mapping, can determine similar between word and word according to nearly word form dictionary.For phonetic notation, if then can be set phonetic notation it Between it is identical, difference one or more of one phonetic symbol and two phonetic symbols be phonetic notation it is close.Mapping character can be with It is mapping one or two of word and corresponding phonetic symbol.Mapping relations are pre-set, therefore after obtaining domain term, The corresponding mapping character of domain term can be obtained according to mapping relations.
For an actual example, it is assumed that obtain " financing is logical " as the corresponding domain term in financing field, and closely mapped according to shape The nearly word form of relationship available " reason " is " inner ", then " financing is logical " corresponding mapping character may include " inner wealth is logical ".And root Obtain that the nearly character of the sound " led to " is " same " and the nearly phonetic character of corresponding sound is " ton " according to the nearly mapping relations of sound, therefore, then " financing is logical " corresponding mapping character may include " financing is same " and " licaiton ".
Step S304 establishes the incidence relation between domain term and mapping character.
It specifically, can be by domain term and corresponding mapping character associated storage, it is established that field after obtaining mapping character Incidence relation between word and mapping character.Such as can establish association dictionary, field of storage word and mapping character in dictionary Between incidence relation.
In one embodiment, the incidence relation of domain term and mapping character can be through dictionary realization, domain term Incidence relation between mapping character can store in error correction map dictionary, error correction map dictionary include the nearly mapping table of sound with And the nearly mapping table of shape.It as shown in Figure 3B, is the stream for the incidence relation established in one embodiment between domain term and mapping character Cheng Tu.After obtaining new field set of words using new word discovery module, it can use pre-set nearly word form dictionary and obtain The corresponding nearly word form of domain term, establishes the mapping relations of nearly word form and domain term, obtains similar words mapping table, similar words mapping table As shown in table 1.It can use phonetic notation module and obtain the nearly phonetic character of the corresponding sound of new domain term, establish the nearly word of sound and domain term Mapping relations, obtain the nearly mapping table of sound, the nearly error correction map table of sound is as shown in table 2.By Tables 1 and 2 storage to error correction map word Library.
In one embodiment, the incidence relation of non-domain term and mapping character can also be stored in above-mentioned dictionary.For example, Assuming that " what " is non-domain term, mapping table can be as shown in table 3.
Table 3
Phonetic Vocabulary
licaitong Financing is logical
shouyi Income
licai Financing
shenme What
In one embodiment, above-mentioned incidence relation can be what online dynamic updated, such as can be every preset time The method for executing text identification, obtains domain term, obtains the corresponding mapping character of domain term, establish domain term and mapping character it Between incidence relation.In this way, new domain term can be obtained constantly.
In one embodiment, as shown in figure 4, obtaining target candidate word according to the character in text to be identified and including:
Step S402 generates initial candidate set of words according to the proximity relations of character in text to be identified.
Specifically, the number of character can according to need determination in initial candidate word, for example, it may be 2 or 3. Each character in initial candidate word is adjacent character in text to be identified.It, can will be to when obtaining text to be identified Identify the character combination of character composition of arbitrary neighborhood in text as initial candidate word.Initial candidate in initial candidate set of words The number of word can according to need determination.For example, it may be by the character group of character composition adjacent two-by-two in text to be identified It closes and is used as initial candidate word, be also possible to that the word of target domain, invalid word such as modal particle, power-assist will be had determined as Word obtains initial candidate set of words after removing.In one embodiment, can by according in text to be identified character it is neighbouring The character combination that relationship generates is compared with the word in dictionary and/or dictionary, the word that will be not present in dictionary and/or dictionary Language is as initial candidate word, in this way, it is possible to reduce the number of initial candidate word, and obtained neologisms.
It is initial to calculate word of each initial candidate word in target text set in initial candidate set of words by step S404 The degree of association and word independence degree.
Specifically, word association degree is used to indicate the tightness degree between the character of composition word.High initial of the degree of association The probability that candidate word occurs in the application is big.Word independence degree refers to the possibility degree of the word separate words.Word independence degree is high, Illustrate that a possibility that initial candidate word is a complete word is high.Word association degree and word independence degree are according to target text What set obtained.
In one embodiment, word association degree can use the PMI (Pointwise of initial candidate word MutualInformation, mutual information between point) it indicates.What mutual information PMI was measured is the correlation between two stochastic variables between point Property.The each character that probability and initial candidate word that initial candidate word occurs can be obtained according to target text set occurs Probability, it is mutual between put according to the probability that the probability and each character that initial candidate word occurs in target text set occur Information.For example, mutual information can be calculated with formula (1) between then putting, wherein p for the initial candidate word being made of " xy " (xy) referring to the probability that initial candidate word " xy " occurs, p (x), p (y) respectively refer to the probability that " x " and " y " occur, P (y | x)=C (xy)/C (x), P (xy)=P (x) * P (y | x), P (x)=C (x)/C (ALL), P (y)=C (y)/C (ALL).C(xy),C(x),C (y) refer to the number that " xy ", " x ", " y " occur in target text set.In P (y | x) feeling the pulse with the finger-tip mark text collection, occur at " x " Under conditions of, the latter character is the probability of " y ".
In one embodiment, PMI can be normalized, using obtained normalization PMI as word association degree.? In one embodiment, normalize PMI calculation method be formulated it is as follows: N__PMI=
PMI/H (x) or N_PMI=PMI/H (y), wherein N__PMI, which refers to, normalizes PMI, H (x)=
P (x) * log2P (x), H (x)=P (y) * log2P (y) can take one in PMI/H (x) and PMI/H (y) As normalization PMI, such as using smaller value therein as normalization PMI, therefore when calculating normalization PMI, available H (x) with the smaller value in H (y), PMI is divided by with smaller value, obtains normalization PMI.
In one embodiment, word independence degree can be determined according to the entropy of initial candidate word.The entropy of initial candidate word can To be at least one of left entropy and right entropy.Entropy is for indicating information content.Left entropy indicates initial candidate word information content above, Right entropy indicates the information content of initial candidate word hereafter.The left entropy and right entropy of initial candidate word embody the upper and lower of initial candidate word The active degree of text, if left entropy is high, the object that illustrates to arrange in pairs or groups above enriches, if right entropy is high, illustrates object of hereafter arranging in pairs or groups It is abundant.And object of arranging in pairs or groups is abundant, then illustrates that initial candidate word freedom degree is relatively high, therefore high a possibility that separate words.And entropy It is low, then show that collocation object is relatively simple, needs to carry out collocation with fixed character could to use, therefore a possibility that separate words It is relatively low.Wherein, the calculation formula of left entropy and right entropy can be indicated such as (2), (3), wherein EL(W) refer to a left side for initial candidate word Entropy, ER(W) refer to the right entropy of initial candidate word.It is located at the character set on the initial candidate word left side in A feeling the pulse with the finger-tip mark text collection, a is There is the probability of W, P (Wb/W) in p (aW/a) feeling the pulse with the finger-tip mark text collection in the case where there is a in character in character set A In feeling the pulse with the finger-tip mark text collection in the case where there is W, there is the probability of b.
In one embodiment, word independence degree can according to the left entropy of initial candidate word and right entropy and determine.It is left The sum formula of entropy and right entropy can be expressed as follows: E=EL(W)+ERIt (W), can be only as the word of initial candidate word using E Vertical degree.
Step S406 is generated according to the word that each initial candidate word is calculated in word association degree and word independence degree Degree.
Specifically, word generation degree is for a possibility that measuring the initial candidate word word newly-generated as one.Word The language degree of association and word generation degree correlation, word independence degree and word generation degree correlation.In a reality It applies in example, corresponding word generation degree mapping value can be obtained according to word independence degree, be generated according to word association degree and word Degree mapping value obtains word generation degree.
In one embodiment, the word of each initial candidate word is calculated according to word association degree and word independence degree Language generation degree includes: to determine corresponding associated confidence according to frequency of occurrence of the initial candidate word in target text set;Root The word initial association degree of initial candidate word is determined according to probability of occurrence of the initial candidate word in target text set;According to initial Word target association degree is calculated in the corresponding associated confidence of candidate word and word initial association degree.
Specifically, associated confidence reflects the confidence level for the word association degree being calculated.Word initial association degree can It is obtained in the method with reference to above-mentioned calculating PMI.Due to calculate word association spend when, may due in text word sum To measure less, the probability for causing initial candidate word to occur is high, thus the situation for keeping the word association degree being calculated high, therefore can be with Degree of association confidence level is determined according to word frequency of occurrence.After initial association degree is calculated, according to degree of association confidence level to first The beginning degree of association is adjusted, and obtains word target association degree.For example, word target association degree can be word initial association degree with The product of degree of association confidence level.
In one embodiment, the calculation formula of word generation degree can be expressed as follows with formula (4), wherein U (W) table Show that the word generation degree of initial candidate word W, N_PMI (W) indicate that the word association degree of initial candidate word W, C (W) refer to initial candidate Frequency of occurrence of the word W in target text set.H (W) indicates the corresponding penalty value of word independence degree, wherein word independence degree Penalty value determined according to the corresponding range of word independence degree.The corresponding penalty value of the big range of the numerical value range pair smaller than numerical value The penalty value answered is small.For example, can be set when word independence degree is less than 1, penalty value 3.It can be set when word independence degree Greater than 1 and when being equal to 1, penalty value 0.
U (W)=N_PMI (W) * log (C (W))-h (W) (4)
Step S408 is screened from initial candidate set of words according to the word generation degree of each initial candidate word and is obtained target Candidate word.
In one embodiment, word generation degree can be greater than to the initial candidate word of the 5th threshold value as target candidate Word can also be ranked up the word generation degree in initial candidate set of words according to the sequence of word generation degree from big to small, Using sequence in preceding d a initial candidate word as target candidate word.Alternatively, before being ordered as in e initial candidate words, word Independent degree is greater than the initial candidate word of the 6th threshold value as target candidate word.Wherein, d, e, the 5th threshold value and the 6th threshold value Specific value can according to need setting.
In one embodiment, text to be identified is obtained from target text set, by word association degree and Word independence degree screens to obtain target candidate word, can get newly generated neologisms in target text set, and then to neologisms It whether is that domain term is judged, therefore new domain term can be got according to target text set.
In one embodiment, text recognition method is further comprising the steps of: when the corresponding word of initial candidate word is independent When degree is less than first threshold, according to the adjacent character of initial candidate word and initial candidate word in text to be identified formed it is new just Beginning candidate word;Initial candidate set of words is added in new initial candidate word.
Specifically, when word independence degree is smaller, a possibility that illustrating the initial candidate word separate words, is small, need with Other character combinations are possible to become independent word.Therefore, first threshold can be set, by the corresponding word of initial candidate word Language independence degree is compared with first threshold, if it is less than first threshold, then obtains adjacent with the initial candidate in text to be identified The adjacent character and initial candidate word are formed new initial candidate word, are then added to new initial candidate word by character In initial candidate set of words, to calculate the word association degree and word independence degree of the new initial candidate word, at the beginning of new The word generation degree of new initial candidate word is calculated in the word association degree and word independence degree of beginning candidate word, with from initial Screening obtains target candidate word in candidate collection.In the embodiment of the present invention, when being less than first threshold by word independence degree, continue It obtains the adjacent character of the initial candidate word and forms new initial candidate word, therefore more acurrate and more neck can be obtained Domain word.
It in one embodiment, can be with when the word independence degree and/or word association for calculating new initial candidate word are spent Initial candidate word before adjacent character is added is calculated as a whole.For example, it is assumed that being added before adjacent character Initial candidate word be " ab ", " ab " corresponding word independence degree is 1, first threshold 2, since " ab " corresponding word is independent Degree is less than first threshold, therefore can will be added in " ab " in " ab " adjacent character " c ", forms new initial candidate word " abc ", when the word association of calculating " abc " is spent, as a whole by " ab ", i.e. a character.Therefore, it is calculating Between " abc " corresponding point when mutual information PMI, can " x " by " ab " as formula (1), " y " by " c " as formula (1).
As shown in figure 5, in one embodiment it is proposed that a kind of text handling method, the present embodiment is mainly in this way Applied in above-mentioned Fig. 1 terminal 110 or server 120 illustrate.It can specifically include following steps:
Step S502 obtains initial input text.
Specifically, initial input text is to need to carry out text-processing, with the word in review text, obtains correct mesh The text of mark input text.Initial input text, which can be issued text in the application, also may be at input state Text, such as the text of the input frame input by application.It is appreciated that when entering specific webpage by browser, it should Webpage can be considered as the webpage of an application, be webpage version using corresponding webpage.For example, if user is corresponding in application It makes comments in forum, then it can be using the comment as initial input text.If it is in " financing is logical " the corresponding customer service of webpage Session interface input inquiry sentence " I will how show interest in in show interest in " when, then will " I will how show interest in in show interest in " make For initial input text.Wherein, " financing is logical " is the title of financing application.
Step S504 obtains the corresponding incidence relation of the corresponding target domain of initial input text, and incidence relation is field Incidence relation between word and mapping character, domain term are according to the corresponding text to be identified of target domain, general field text This set target text set corresponding with target domain identifies.
Specifically, the corresponding target domain of initial input text can be determines according to the source of initial input text, example As can be belonging to initial input text using corresponding field.For example, if initial input text is in medical APP It is obtained in (Application, using), then target domain is medical field.The corresponding target domain of initial input text It can be obtained according to the realm information of input.When needing the text to a certain field to handle, it is defeated to can receive user The text to be processed entered and the corresponding field of text to be processed.
The corresponding incidence relation of target domain be it is pre-set, preset the corresponding domain term of target domain and mapping Incidence relation between character, in this way, when needing to be adjusted text, it is available to arrive the corresponding field of the target domain The incidence relation of word and mapping character carries out initial input text with acquiring the corresponding domain term of initial input text Amendment.Incidence relation between domain term and mapping character can be the nearly incidence relation of shape, at least one in the nearly incidence relation of sound Kind.Domain term is identified according to text to be identified, general field text collection and the corresponding target text set of target domain It arrives.Target candidate word can be obtained according to the proximity relations of the text to be identified of application, according to target candidate word in general neck The different degree of domain text collection and target text set determines that the target candidate word is the domain term of target domain.The knowledge of domain term Other method is referred to the text recognition method in above-described embodiment and determines, specifically repeats no more.
Step S506 determines the corresponding target domain word of initial input text according to initial input text and incidence relation.
Specifically, the character in available initial input text, is matched with mapping character, by matched mapping word Corresponding domain term is accorded with as target domain word.It in one embodiment, can also be right in advance when obtaining initial input text Initial input text is segmented, and word sequence is obtained, and each word in word sequence is matched with mapping character, is reflected matched Character is penetrated as target domain word.
Step S508 obtains target input text according to the whole initial input text of target domain tone.
Specifically, after obtaining target domain word, target domain word can be replaced into corresponding character in initial input text, Obtain target input text.It, can also be to target domain when target domain word is multiple and/or when further including other non-domain terms Word is screened, and the domain term of initial input text is adjusted.The method of screening for example can be n-gram (n-gram Grammar) model, such as 2 metagrammar models or 3 metagrammar models etc..
Above-mentioned text handling method, can be according to the pass between the corresponding domain term of target domain and mapping character of application System determines the corresponding domain term of text inputted in the application, and is adjusted according to text of the domain term to initial input, obtains To target text.Since domain term is obtained according to the text identification of text to be identified, general field text and target domain , it is the relevant word of target domain, therefore the accuracy height of the target input text adjusted for specific area, thus Realization obtains target input text corresponding with target domain under application, and field specific aim and adaptability are high.Further, In the case where differentiation target domain corresponding domain term, additionally it is possible to which the quantity for reducing word in error correction association vocabulary improves word The efficiency of language processing.
It in one embodiment, can also be using target input text to target text after obtaining target input text Set is updated, using target input text as the text in target text set.
In one embodiment, as shown in fig. 6, step S508 is obtained according to the whole initial input text of target domain tone Target inputs text
Step S602 obtains the corresponding each candidate input word of initial input text.
Specifically, it may include one or more words in initial input text, initial input text can be divided Word obtains sequence of terms, obtains the corresponding candidate input word of each word, as the corresponding candidate input of initial input text Word.Word association relationship is pre-set, corresponding candidate input word is obtained according to word association relationship.For example, can be set " what " " careful " corresponding conjunctive word be.Therefore, available to correspondence if in initial input text including " careful " Candidate input word " what ".
For example, spelling error correction association dictionary, error correction then can be set if it is needing to carry out error correction to initial input text The incidence relation of input word and domain term is stored in association dictionary, as shown in table 1, table 2.When initial input text is phonetic notation symbol When number such as phonetic, then corresponding domain term can be got according to phonetic.In one embodiment, when initial input text includes When word, word can also be converted to phonetic symbol, then obtain the corresponding domain term of phonetic symbol.For example, it is assumed that initial Inputting includes " example ability ton " in text, and available " example ability ton " corresponding phonetic is " licaiton ", and can be with according to table 2 Obtaining " licaiton " corresponding target domain word is " financing is logical ".
In one embodiment, initial input text can be segmented, after obtaining corresponding word sequence, obtains word order The corresponding similar words of each word in column, as the corresponding candidate word of initial input text.Also available each word sequence pair Then the pinyin sequence answered obtains the corresponding candidate word of each phonetic in pinyin sequence, as the corresponding time of initial input text Select word.It is appreciated that can also regard similar words and the corresponding candidate word of phonetic as the corresponding candidate of initial input text Word.
In one embodiment, if to carry out error correction to initial input text, initial input text can be carried out Errors present detection obtains the corresponding candidate input word of word of errors present.Errors present detection, which can be, utilizes artificial intelligence Energy machine learning model detection.In one embodiment, the step of errors present detects may include: to calculate initial input text Transition probability in this between each adjacent word, the position using transition probability lower than preset value is as errors present.Adjacent word The calculation formula of transition probability between language can be expressed as follows: and P (G | F)=C (FG)/C (F), wherein " F " and " G " is just Begin to input word adjacent in text, and " F ", before " G ", C (FG) refers to the frequency of occurrence of " FG " in pre-set text set, C (F) refer to that the frequency of occurrence of " F " in pre-set text set, pre-set text set are the corresponding text collections of target domain, such as It can be target text set.
In one embodiment, in the application scenarios of target domain, the effect of errors present inspection is simultaneously not so good as general Under field, because under target domain, it may appear that but still need error correction situation in the text of general field absolutely not problem, Such as " my handicraft " this in the general field text that there is no problem, correct text should be " I in financial field Income ", and " skeleton " be in the general field word that there is no problem, correct word should be " stock in financial field Valence ".It therefore, can be by initial input text when the initial input text of the application to target domain carries out error detection Each position is used as errors present.
Step S604 constructs word according to the component relationship of the word of initial input text, candidate input word, target domain word Language relation chain set.
Specifically, word relation chain set includes one or more of word relation chains.Word relation chain be by word according to The relation chain of secondary connection composition.The component relationship of word refers to sequence and connection relationship in text between word.It is initial defeated It is fixed for entering the component relationship of word and word in text, it is assumed for example that initial input text is " today is Friday ", then Word after cutting is " today " "Yes" " Friday " these three words, and the order of connection is also to be followed successively by " today " "Yes" " week Five ".After when obtaining candidate input word, target domain word, it is also desirable to according to the component relationship of the word of initial input text according to Secondary connection obtains corresponding word relation chain.Since initial input text can have one or more cutting methods, and after cutting Word can correspond to one or more candidate input words again, therefore word relation chain can have one or more.
As shown in fig. 7, being below " it is careful product that inner wealth is logical " with initial input text, and initial input text is carried out For phonetically similar word error correction, the method for obtaining word relation chain is illustrated.Available first " it is careful product that inner wealth is logical " In the corresponding phonetic of each character, obtain corresponding pinyin sequence be " li, cai, tong, shi, shen, me, chan, pin ", Using phonetic segmentation algorithm to the pinyin sequence carry out cutting, obtain by " li, cai, tong ", " shi ", " shen, me ", " chan, pin " composition pinyin sequence and by " li, cai ", " tong, shi ", " shen, me ", " chan, pin " composition Pinyin sequence.Then candidate input word is obtained according to the table of comparisons of phonetic to candidate word, the table of comparisons of phonetic to candidate word can be with Including the corresponding phonetic of domain term and the corresponding phonetic of non-domain term.For example, in above-mentioned candidate word, " financing is logical " and " reason Wealth " is domain term, and other candidate words are non-domain term.After obtaining candidate input word, according to the word of initial input text Component relationship construct word relation chain, wherein in Fig. 7, word relation chain may include " financing → simultaneously → what → product ", " financing → colleague → what → product ", " financing lead to → is → what → product " and " financing leads to → when → what → product " Totally four word relation chains.
It is appreciated that the above phonetically similar word error correction is only a kind of example, it in practical applications, can also be to initial input text Nearly word form error correction is carried out, or phonetically similar word error correction and nearly word form error correction are carried out to initial input text simultaneously.
Step S606 calculates the transition probability for being transferred to current term in word relation chain by forward direction word.
Specifically, forward direction word is the word being located at before current term in word relation chain.Can be it is whole before To word, it is also possible to default forward direction word, such as 1 or 2 words, it specifically can be according to used language model It determines.Transition probability is indicated in the case where there is specific forward direction word, the probability of current term occurs, by word relation chain It is considered as hidden Markov state chain, what transition probability indicated is the probability that current state is transferred to by the state of forward direction.Transfer is general Rate can be formulated as p (J Shu I), be indicated under conditions of forward direction word I, the probability that current term J occurs.Word relationship Include multiple words in chain, using each word of word relation chain as current term, calculates and arrive corresponding transition probability.Example Such as, if it is 2 metagrammar models are used, then the transition probability that current term is transferred to by preceding 1 word is calculated, if it is use 3 metagrammar models then calculate the transition probability that current term is transferred to by preceding 2 words.
When calculating transition probability, target neck can be obtained by the preceding combination to word and current term as a whole First number and target that the combination of forward direction word and current term occurs as a whole in the corresponding text collection in domain It is general to obtain transfer according to first number and second number for second number that forward direction word occurs in the corresponding text collection in field Rate, for example, p (J Shu I)=Count (IJ)/count (I), wherein Count (IJ) is IJ in the corresponding text collection of target domain The number of middle appearance, count (I) are the number that I occurs in the corresponding text collection of target domain, and target domain is corresponding Text collection identical as target text set can also be different, i.e., the corresponding text collection of target domain can have more It is a, when carrying out text identification, it can use A text collection i.e. target text set and carry out text identification, carrying out at text When reason, B text collection can be used.
Step S608 obtains the bonding strength of word relation chain according to the corresponding each transition probability of word relation chain.
Specifically, the bonding strength of word relation chain indicates that each word is combined together into sentence in word relation chain A possibility that, bonding strength is big, and a possibility that becoming sentence is big.The company of word relation chain can be obtained in conjunction with each transition probability Connect intensity.Such as assume that a word relation chain is " A → B → C → D ", then relation chain intensity is P (ABCD), calculation formula It can be as shown in formula (5), wherein P (A) indicates that A is the probability of first word of sentence, and P (B | A) is by forward direction word A It is transferred to the probability of current term B, P (C | B) is the probability that current term C is transferred to by forward direction word B, P (D | C) it is by forward direction Word C is transferred to the probability of current term D, and P (D) indicates that A is the probability of the last one word of sentence.
P (ABCD)=P (A) * P (B | A) * P (C | B) * P (D | C) * P (D). (5)
In one embodiment, transition probability is calculated according to word frequency of occurrence, for example, P (A) can be equal to C (A)/C (ALL) can also be equal to C (A ")/C (ALL), and wherein C (A) is that middle A occurs in the corresponding text collection of target domain Number, C (A ") are the number for first word that A is sentence in the corresponding text collection of target domain.C (ALL) is target Word frequency of occurrence can be stored in advance in the number of word in the corresponding text collection in field, and online dynamic is led according to target The variation of the corresponding text collection in domain more neologism frequency of occurrence, to be updated according to the variation of the corresponding text collection of target domain Relation chain bonding strength.For example, the new text updated in the corresponding application of target domain can be obtained every preset time, calculate The sum of each word occurs in new text the frequency and word, to realize that word transition probability is more in n-gram model Newly.In this way, when new application is come into operation, it can be with the accumulation of applicating Chinese sheet, to the initial input text in application Adjustment it is more accurate.
Step S610 is screened from word relation chain set according to the bonding strength of word relation chain and is obtained target word pass The corresponding text of target word relation chain is inputted text by tethers.
Specifically, after obtaining word relation chain intensity, it can choose the maximum word relation chain of bonding strength as target The corresponding text of target word relation chain is inputted text by relation chain.It is of course also possible to select multiple word relationships Chain is as target word relation chain, for example, using bonding strength according to sorting from large to small the word relation chain for preceding z as mesh Mark word relation chain.Z is the integer greater than 1, and specific size can according to need setting, for example, 3.
In one embodiment, when calculating the bonding strength of word relation chain, it can be and calculate word relation chain set In each word relation chain bonding strength, be also possible to the bonding strength of calculating section word relation chain.For example, using dimension Spy calculates than algorithm.
In one embodiment, it if using 2 metagrammar models, can be obtained using vertebi (Viterbi) algorithm Take target word relation chain.Assume in viterbi algorithm when entering state i+1 from state i, if from starting point S to state i The shortest path of each node has been found, then in the shortest path for calculating some nodes X i+1 from starting point S to i+1 state When diameter, as long as considering the shortest path of the k node all from S to preceding state i, and from this k node respectively to Xi+ 1 distance.In embodiments of the present invention, if using viterbi algorithm, can using the word of word relation chain as State in viterbi algorithm, using transition probability as the corresponding weight in path, the corresponding target of viterbi algorithm is to ask most Maximum bonding strength is calculated according to the corresponding transition probability of word relation chain in big bonding strength.Therefore, word pass is being calculated When the bonding strength of tethers, the most Dalian in word relation chain set from word relation chain starting point into each previous node is calculated Connect intensity, then calculate each previous node to present node transition probability.The corresponding maximum connection of each previous node is strong Degree is multiplied with corresponding transition probability, obtains the bonding strength from relation chain starting point to present node, and therefrom screening is worked as The corresponding maximum bonding strength of front nodal point.If there is also next nodes after present node, using next node as current Node, the method for repeating above-mentioned calculating maximum bonding strength, until the last one node of word relation chain.
With the word relation chain in Fig. 7, and for use viterbi algorithm acquisition relationship by objective (RBO) chain, in Fig. 8, S0 and S1 Respectively indicate " financing ", " financing is logical " corresponding probability.Letter in word relation chain above horizontal line "-" is indicated from horizontal line The transition probability of the word after word to horizontal line "-" before "-", such as W1 indicate to be transferred to " colleague " by " financing " Transition probability.So from relation chain starting point to second node " simultaneously ", " colleague " "Yes" and " when " relation chain most Big bonding strength be s0*w0, s0*w1, s1*w4, s1*w5.Therefore, node " what " is being calculated from starting point to third most When big bonding strength, s0*w0 is multiplied with W2, s0*w1 is multiplied with W3, s1*w4 is multiplied with W6, s1*w5 is multiplied with W7, Assuming that obtaining maximum bonding strength is s1*w4*W6, then the available best road by relation chain intensity to " what " node Diameter is " financing lead to → is ", since the last node of each word relation chain is " product ", available maximum connection Intensity be s1*w4*W6*W8. target word relation chain be " financing lead to → be → what → product ", therefore target input text be " it is any product that financing is logical ".
In one embodiment, target word relation chain can also be calculated using 3 yuan or more of syntactic model.When When using 3 yuan or more of syntactic model, in order to reduce the number for calculating bonding strength, calculating from starting point S to i+1 state Some nodes X i+1 corresponding bonding strength when, the preceding g bonding strength of available previous state, that is, th state, then benefit The corresponding bonding strength of i+1 is calculated with preceding g bonding strength and the transition probability of th state to i+1.Wherein, g Value can according to need determination.
In one embodiment, the step of obtaining initial input text may include: the inquiry for obtaining and inputting in the application Sentence, using query statement as initial input text;Text handling method can with the following steps are included: obtain inquiry request, It include that the corresponding target of query statement inputs text in inquiry request;It obtains and the inquiry response number that text obtains is inputted according to target According to.
Specifically, query statement can be inputs on the corresponding query interface of application, for example, it may be in the application The input frame input at corresponding session interface is seeked advice from progress customer service.The mode of input can pass through voice or text etc..Such as Fruit be it is by voice input, then voice can be detected, obtain query statement.After obtaining initial input text, execute Text handling method provided in an embodiment of the present invention obtains target input text.Inquiry response data are target input texts pair The answer statement answered.The corresponding inquiry response data of target input text can be pre-set.For example, introduction can be set It manages money matters and leads to the product introduction text of product.After obtaining target input text, corresponding product introduction text is obtained, as inquiry Response data.Inquiry request can be to be also possible to service by what the operation of reception user triggered after obtaining target input text Device automatic trigger.
In one embodiment, inquiry request can be to trigger after obtaining target input text by receiving the operation of user 's.For example, as shown in figure 9, when user inputs in input frame " it is careful product that inner wealth is logical ", terminal or server can be with Text handling method provided in an embodiment of the present invention is executed, obtains target input text " it is that financing is logical for what product ", terminal obtains After getting target input text, target input text is shown in the top of input frame, if receiving user to " financing is logical to be The selection operation of what product " can then send inquiry request to server, and server receives the inquiry request, obtains and corresponds to Inquiry response data, and return to terminal, terminal show inquiry response data.
In one embodiment, inquiry request can be server automatic trigger.For example, when receiving initial input text After this, terminal sends initial input text in server, and server executes text-processing side provided in an embodiment of the present invention Method triggers inquiry request after obtaining target input text, inputs text according to target and obtain corresponding inquiry response data.
In one embodiment, as shown in Figure 10, text handling method can also include:
Step S1002, the corresponding target type of detection target input text.
Specifically, the corresponding type of target input text is obtained from candidate type.Candidate type specifically can root According to needing to be arranged.It such as may include normal type and abnormal type.Candidate type also may include advertisement type and Non- advertisement type etc..After obtaining target input text, whether the word that can detecte in target input text includes preset word Language, if including, using the type as target type.Or target is inputted into text input and is sentenced to preparatory trained type In other artificial intelligence machine model, corresponding target type is obtained.For example, it is assumed that initial input text be " prediction of bone valence is accurate, It is benefited high, prestige 123456789 " please be add, corresponding target input text is that " Forecasting of Stock Prices is accurate, income is high, please add wechat 123456789".If may result in initial input text quilt according to the corresponding target type of initial input text detection Be judged as non-advertisement type, and detected if inputting text according to target, the target type detected be accurately, For advertisement type.
Step S1004 carried out initial input text when the corresponding type of target input text is preset kind Filter.
Specifically, filtering can be that the initial input text is shielded on the corresponding display interface of initial input text, It can be and delete the initial input text etc. in the application, specifically can according to need setting.
As shown in figure 11, below for carrying out error correction to initial input text, to text provided in an embodiment of the present invention Processing method is illustrated.
1, terminal receives user by the initial input text of the customer service session interface input in application, and by initial input Text is sent in server.
2, server carries out errors present detection to initial input text, errors present set is obtained, due in general neck There is no the sentences of mistake in domain, may be mistake using corresponding target domain, therefore all positions can made For errors present.
3, the corresponding candidate input word of input word that server obtains each errors present according to word association dictionary, obtains Candidate's input set of words.Wherein, the domain term of word association dictionary, which can be, is obtained by domain term identification module, domain term Identification module can carry out text identification every preset duration, obtain new domain term, and obtain that new domain term is corresponding to reflect Character is penetrated, by new domain term and corresponding mapping character associated storage into dictionary.Therefore domain term identification module is supported Online updating word, the increase for the content that word association dictionary is applied with place update domain term.
4, after server obtains candidate input set of words, n-gram model is formed according to the word of initial input text and is closed System's building word relation chain.It using n-gram model filters out optimum from word relation chain, such as will be calculated The maximum target word relation chain of bonding strength is as optimum.The maximum target word relation chain of bonding strength is corresponding Text inputs text as final error correction result, the corresponding word composition target of target word relation chain.Wherein it is possible to according to answering The corresponding frequency of occurrences of each word in n-gram model is updated with the variation of middle text, to realize the online of n-gram model It updates.
6, server inputs text query to corresponding answer statement according to target, and answer statement is returned in terminal.
7, terminal is in the customer service session interface display answer statement.
As shown in figure 12, in one embodiment, a kind of text identification device is provided, text identification device can collect At can specifically include in above-mentioned server 120 and terminal 110, target candidate word obtains module 1202, set obtains Module 1204, different degree computing module 1206, the degree of correlation obtain module 1208 and domain term obtains module 1210.
Target candidate word obtains module 1202, for obtaining text to be identified, is obtained according to the character in text to be identified Target candidate word;
Set obtains module 1204, for obtaining general field text collection and the corresponding target domain of text to be identified Target text set;
Different degree computing module 1206, for calculate target different degree of the target candidate word in target text set and In the reference different degree of general field text collection;
The degree of correlation obtains module 1208, based on according to the corresponding target different degree of target candidate word and with reference to different degree Calculation obtains the target degree of correlation of target candidate word and target domain;
Domain term obtain module 1210, for according to the target degree of correlation using target candidate word as the field of target domain Word.
In one embodiment, as shown in figure 13, text identification device further include:
Mapping character determining module 1302, for determining the corresponding mapping character of domain term according to mapping relations, mapping is closed System includes that shape closely maps, sound at least one of closely maps;
Incidence relation establishes module 1304, for establishing the incidence relation between domain term and mapping character.
In one embodiment, target candidate word obtains module 1202 and is used for: according in text to be identified character it is neighbouring Relationship generates initial candidate set of words;Calculate word of each initial candidate word in target text set in initial candidate set of words The language degree of association and word independence degree;The word of each initial candidate word is calculated according to word association degree and word independence degree Language generation degree;It is screened from initial candidate set of words according to the word generation degree of each initial candidate word and obtains target candidate word.
In one embodiment, text identification device further include: word forms module, corresponding for working as initial candidate word When word independence degree is less than first threshold, according to the adjacent character shape of initial candidate word and initial candidate word in text to be identified The initial candidate word of Cheng Xin;Module is added, for initial candidate set of words to be added in new initial candidate word.
In one embodiment, the degree of correlation obtains module 1208 and is used for: according to the corresponding target different degree of target candidate word And the initial degree of correlation of target candidate word and target domain is calculated with reference to different degree;According to target candidate word in target text Frequency of occurrence in this set determines corresponding degree of correlation confidence level;Mesh is obtained according to the initial degree of correlation and degree of correlation confidence level Mark the degree of correlation.
As shown in figure 14, in one embodiment, a kind of text processing apparatus is provided, text processing unit can collect At can specifically include in above-mentioned server 120 and terminal 110, initial input text obtains module 1402, association is closed System obtains module 1404, target domain word obtains module 1406 and target input text obtains module 1408.
Initial input text obtains module 1402, for obtaining the initial input text for passing through application input;
Incidence relation obtains module 1404, and for obtaining the corresponding incidence relation of target domain of application, incidence relation is Incidence relation between domain term and mapping character, domain term are text to be identified, the general field text collection according to application What target text set corresponding with target domain identified;
Target domain word obtains module 1406, for determining initial input text according to initial input text and incidence relation Corresponding target domain word;
Target input text obtains module 1408, defeated for obtaining target according to the whole initial input text of target domain tone Enter text.
In one embodiment, target input text obtains module 1408 and is used for: it is corresponding each to obtain initial input text A candidate's input word;It is closed according to the component relationship of the word of initial input text, candidate input word, target domain word building word Tethers set;Calculate the transition probability for being transferred to current term in word relation chain by forward direction word;According to word relation chain pair The each transition probability answered obtains the bonding strength of word relation chain;According to the bonding strength of word relation chain from word relation chain Screening obtains target word relation chain in set, inputs text for the corresponding text of target word relation chain as target.
In one embodiment, initial input text obtains module and is used for: the query statement by application input is obtained, it will Query statement is as initial input text;
Text processing apparatus further include: inquiry request module includes inquiry language in inquiry request for obtaining inquiry request The corresponding target of sentence inputs text;Inquiry response data acquisition module inputs the inquiry that text obtains according to target for obtaining Response data.
In one embodiment, text processing apparatus further include: target type obtains module, for detecting target input text This corresponding target type;Filtering module is used for when the corresponding target type of target input text is preset kind, to initial Input text is filtered.
Figure 15 shows the internal structure chart of computer equipment in one embodiment.The computer equipment specifically can be figure Terminal 110 in 1.As shown in figure 15, it includes the place connected by system bus which, which includes the computer equipment, Manage device, memory, network interface, input unit and display screen.Wherein, memory includes non-volatile memory medium and interior storage Device.The non-volatile memory medium of the computer equipment is stored with operating system, can also be stored with computer program, the computer When program is executed by processor, processor may make to realize at least one of text recognition method and text handling method side Method.Computer program can also be stored in the built-in storage, when which is executed by processor, processor may make to hold At least one method of row text recognition method and text handling method.The display screen of computer equipment can be liquid crystal display Screen or electric ink display screen, the input unit of computer equipment can be the touch layer covered on display screen, be also possible to Key, trace ball or the Trackpad being arranged on computer equipment shell can also be external keyboard, Trackpad or mouse etc..
Figure 16 shows the internal structure chart of computer equipment in one embodiment.The computer equipment specifically can be figure Server 120 in 1.As shown in figure 16, it includes being connected by system bus which, which includes the computer equipment, Processor, memory and network interface.Wherein, memory includes non-volatile memory medium and built-in storage.The computer The non-volatile memory medium of equipment is stored with operating system, can also be stored with computer program, and the computer program is processed When device executes, processor may make to realize at least one of text recognition method and text handling method method.The memory Computer program can also be stored in reservoir, when which is executed by processor, processor may make to execute text and know Other at least one of method and text handling method method.
It will be understood by those skilled in the art that structure shown in Figure 15 and 16, only related to application scheme Part-structure block diagram, do not constitute the restriction for the computer equipment being applied thereon to application scheme, it is specific to count Calculating machine equipment may include perhaps combining certain components or with different portions than more or fewer components as shown in the figure Part arrangement.
In one embodiment, text identification device provided by the present application can be implemented as a kind of shape of computer program Formula, computer program can be run in the computer equipment as shown in Figure 15 and 16.It can be deposited in the memory of computer equipment Each program module of storage composition text identification device, for example, target candidate word shown in Figure 12 obtains module 1202, set It obtains module 1204, different degree computing module 1206, the degree of correlation and obtains module 1208 and domain term acquisition module 1210.It is each The computer program that program module is constituted makes processor execute the text of each embodiment of the application described in this specification Step in recognition methods.
For example, computer equipment shown in Figure 16 can pass through the target candidate in text identification device as shown in figure 12 Word obtains module 1202 and obtains text to be identified, obtains target candidate word according to the character in text to be identified;It is obtained by set Modulus block 1204 obtains the target text set of general field text collection and the corresponding target domain of text to be identified;Pass through Different degree computing module 1206 calculates target different degree of the target candidate word in target text set and in general field text The reference different degree of this set;Module 1208 is obtained according to the corresponding target different degree of target candidate word and ginseng by the degree of correlation Examine the target degree of correlation that target candidate word and target domain is calculated in different degree;Module 1210 is obtained by domain term, is used for According to the target degree of correlation using target candidate word as the domain term of target domain.
In one embodiment, text identification device provided by the present application can be implemented as a kind of shape of computer program Formula, computer program can be run in the computer equipment as shown in Figure 15 and 16.It can be deposited in the memory of computer equipment Each program module of storage composition text processing unit, for example, initial input text shown in Figure 14 obtains module 1402, closes Connection Relation acquisition module 1404, target domain word obtain module 1406 and target input text obtains module 1408.Each journey The computer program of sequence module composition executes processor at the text of each embodiment of the application described in this specification Step in reason method.
For example, computer equipment shown in Figure 16 can pass through the initial input in text processing apparatus as shown in figure 14 Text obtains module 1402 and obtains the initial input text inputted by application;The acquisition of module 1404 is obtained by incidence relation to answer The corresponding incidence relation of target domain, incidence relation of the incidence relation between domain term and mapping character, domain term are It is identified according to the corresponding target text set of text to be identified, general field text collection and target domain of application; Module 1406 is obtained by target domain word, and the corresponding mesh of initial input text is determined according to initial input text and incidence relation Mark domain term;Text is inputted by target to obtain module 1408 according to the whole initial input text of target domain tone to obtain target defeated Enter text.
In one embodiment it is proposed that a kind of computer equipment, computer equipment include memory, processor and storage On a memory and the computer program that can run on a processor, processor perform the steps of when executing computer program Text to be identified is obtained, target candidate word is obtained according to the character in text to be identified;Obtain general field text collection and The target text set of the corresponding target domain of text to be identified;Calculate target weight of the target candidate word in target text set Spend and general field text collection reference different degree;According to the corresponding target different degree of target candidate word and reference The target degree of correlation of target candidate word and target domain is calculated in different degree;According to the target degree of correlation using target candidate word as The domain term of target domain.
In one embodiment, performed by processor according to the target degree of correlation using target candidate word as target domain After domain term, computer program also makes following steps performed by processor: determining that domain term is corresponding according to mapping relations Mapping character, mapping relations include that shape closely maps, sound at least one of closely maps;It establishes between domain term and mapping character Incidence relation.
In one embodiment, target candidate word packet is obtained according to the character in text to be identified performed by processor It includes: initial candidate set of words is generated according to the proximity relations of character in text to be identified;It calculates each in initial candidate set of words Word association degree and word independence degree of the initial candidate word in target text set;It is only according to word association degree and word It is vertical to spend the word generation degree that each initial candidate word is calculated;It is waited according to the word generation degree of each initial candidate word from initial Screening in set of words is selected to obtain target candidate word.
In one embodiment, in calculating initial candidate set of words performed by processor each initial candidate word in target Word association degree in text collection include: determined according to frequency of occurrence of the initial candidate word in target text set it is corresponding Associated confidence;Determine that the word of initial candidate word initially closes according to probability of occurrence of the initial candidate word in target text set Connection degree;Word target association degree is calculated according to the corresponding associated confidence of initial candidate word and word initial association degree.
In one embodiment, computer program also makes following steps performed by processor: when initial candidate word is corresponding Word independence degree be less than first threshold when, according to the adjacent character of initial candidate word and initial candidate word in text to be identified Form new initial candidate word;Initial candidate set of words is added in new initial candidate word.
In one embodiment, according to the corresponding target different degree of target candidate word and with reference to weight performed by processor Spend that target candidate word and the target degree of correlation of target domain is calculated includes: important according to the corresponding target of target candidate word Degree and the initial degree of correlation that target candidate word and target domain are calculated with reference to different degree;According to target candidate word in target Frequency of occurrence in text collection determines corresponding degree of correlation confidence level;It is obtained according to the initial degree of correlation and degree of correlation confidence level The target degree of correlation.
In one embodiment it is proposed that a kind of computer equipment, computer equipment include memory, processor and storage On a memory and the computer program that can run on a processor, processor perform the steps of when executing computer program Obtain the initial input text by application input;The corresponding incidence relation of target domain of application is obtained, incidence relation is neck Incidence relation between domain word and mapping character, domain term are led according to text to be identified, general field text collection and target The corresponding target text set in domain identifies;Determine that initial input text is corresponding according to initial input text and incidence relation Target domain word;Target input text is obtained according to the whole initial input text of target domain tone.
In one embodiment, to obtain target according to the whole initial input text of target domain tone performed by processor defeated Entering text includes: to obtain the corresponding each candidate input word of initial input text;According to the composition of the word of initial input text Relationship, candidate input word, target domain word construct word relation chain set;It calculates and is transferred in word relation chain by forward direction word The transition probability of current term;The bonding strength of word relation chain is obtained according to the corresponding each transition probability of word relation chain; It is screened from word relation chain set according to the bonding strength of word relation chain and obtains target word relation chain, target word is closed The corresponding text of tethers inputs text as target.
In one embodiment, acquisition performed by processor includes: to obtain by the initial input text of application input By the query statement of application input, using query statement as initial input text;Computer program also holds processor Row following steps: obtaining inquiry request, includes that the corresponding target of query statement inputs text in inquiry request;It obtains according to target The inquiry response data that input text obtains.
In one embodiment, computer program also makes following steps performed by processor: detection target input text Corresponding target type;When the corresponding target type of target input text is preset kind, initial input text was carried out Filter.
In one embodiment, computer program also makes following steps performed by processor: obtaining the to be identified of application Text obtains target candidate word according to the character in text to be identified;Obtain general field text collection and and target text Set;Calculate target different degree of the target candidate word in target text set and the reference weight in general field text collection It spends;Target candidate word and target domain is calculated according to the corresponding target different degree of target candidate word and with reference to different degree The target degree of correlation;According to the target degree of correlation using target candidate word as the domain term of target domain.
In one embodiment, according to the target degree of correlation using target candidate word as the domain term of target domain after, meter Calculation machine program also makes following steps performed by processor: determining the corresponding mapping character of domain term according to mapping relations, maps Relationship includes that shape closely maps, sound at least one of closely maps;The association established between domain term and corresponding mapping character is closed System.
In one embodiment, a kind of computer readable storage medium is provided, is stored on computer readable storage medium Computer program, when computer program is executed by processor, so that processor executes following steps: obtaining text to be identified, root Target candidate word is obtained according to the character in text to be identified;Obtain general field text collection and the corresponding mesh of text to be identified The target text set in mark field;Calculate target different degree of the target candidate word in target text set and in general field The reference different degree of text collection;Target is calculated according to the corresponding target different degree of target candidate word and with reference to different degree The target degree of correlation of candidate word and target domain;According to the target degree of correlation using target candidate word as the domain term of target domain.
In one embodiment, performed by processor according to the target degree of correlation using target candidate word as target domain After domain term, computer program also makes following steps performed by processor: determining that domain term is corresponding according to mapping relations Mapping character, mapping relations include that shape closely maps, sound at least one of closely maps;It establishes between domain term and mapping character Incidence relation.
In one embodiment, target candidate word packet is obtained according to the character in text to be identified performed by processor It includes: initial candidate set of words is generated according to the proximity relations of character in text to be identified;It calculates each in initial candidate set of words Word association degree and word independence degree of the initial candidate word in target text set;It is only according to word association degree and word It is vertical to spend the word generation degree that each initial candidate word is calculated;It is waited according to the word generation degree of each initial candidate word from initial Screening in set of words is selected to obtain target candidate word.
In one embodiment, in calculating initial candidate set of words performed by processor each initial candidate word in target Word association degree in text collection include: determined according to frequency of occurrence of the initial candidate word in target text set it is corresponding Associated confidence;Determine that the word of initial candidate word initially closes according to probability of occurrence of the initial candidate word in target text set Connection degree;Word target association degree is calculated according to the corresponding associated confidence of initial candidate word and word initial association degree.
In one embodiment, computer program also makes following steps performed by processor: when initial candidate word is corresponding Word independence degree be less than first threshold when, according to the adjacent character of initial candidate word and initial candidate word in text to be identified Form new initial candidate word;Initial candidate set of words is added in new initial candidate word.
In one embodiment, according to the corresponding target different degree of target candidate word and with reference to weight performed by processor Spend that target candidate word and the target degree of correlation of target domain is calculated includes: important according to the corresponding target of target candidate word Degree and the initial degree of correlation that target candidate word and target domain are calculated with reference to different degree;According to target candidate word in target Frequency of occurrence in text collection determines corresponding degree of correlation confidence level;It is obtained according to the initial degree of correlation and degree of correlation confidence level The target degree of correlation.
In one embodiment, a kind of computer readable storage medium is provided, is stored on computer readable storage medium Computer program, when computer program is executed by processor, so that processor executes following steps: obtaining through application input Initial input text;The corresponding incidence relation of target domain of application is obtained, incidence relation is between domain term and mapping character Incidence relation, domain term is according to text to be identified, general field text collection and the corresponding target text collection of target domain Close what identification obtained;The corresponding target domain word of initial input text is determined according to initial input text and incidence relation;According to The whole initial input text of target domain tone obtains target input text.
In one embodiment, to obtain target according to the whole initial input text of target domain tone performed by processor defeated Entering text includes: to obtain the corresponding each candidate input word of initial input text;According to the composition of the word of initial input text Relationship, candidate input word, target domain word construct word relation chain set;It calculates and is transferred in word relation chain by forward direction word The transition probability of current term;The bonding strength of word relation chain is obtained according to the corresponding each transition probability of word relation chain; It is screened from word relation chain set according to the bonding strength of word relation chain and obtains target word relation chain, target word is closed The corresponding text of tethers inputs text as target.
In one embodiment, acquisition performed by processor includes: to obtain by the initial input text of application input By the query statement of application input, using query statement as initial input text;Computer program also holds processor Row following steps: obtaining inquiry request, includes that the corresponding target of query statement inputs text in inquiry request;It obtains according to target The inquiry response data that input text obtains.
In one embodiment, computer program also makes following steps performed by processor: detection target input text Corresponding target type;When the corresponding target type of target input text is preset kind, initial input text was carried out Filter.
In one embodiment, computer program also makes following steps performed by processor: obtaining the to be identified of application Text obtains target candidate word according to the character in text to be identified;Obtain general field text collection and and target text Set;Calculate target different degree of the target candidate word in target text set and the reference weight in general field text collection It spends;Target candidate word and target domain is calculated according to the corresponding target different degree of target candidate word and with reference to different degree The target degree of correlation;According to the target degree of correlation using target candidate word as the domain term of target domain.
In one embodiment, according to the target degree of correlation using target candidate word as the domain term of target domain after, meter Calculation machine program also makes following steps performed by processor:: the corresponding mapping character of domain term is determined according to mapping relations, is mapped Relationship includes that shape closely maps, sound at least one of closely maps;The association established between domain term and corresponding mapping character is closed System.
Although should be understood that various embodiments of the present invention flow chart in each step according to arrow instruction successively It has been shown that, but these steps are not that the inevitable sequence according to arrow instruction successively executes.Unless expressly state otherwise herein, There is no stringent sequences to limit for the execution of these steps, these steps can execute in other order.Moreover, each embodiment In at least part step may include that perhaps these sub-steps of multiple stages or stage are not necessarily multiple sub-steps Completion is executed in synchronization, but can be executed at different times, the execution in these sub-steps or stage sequence is not yet Necessarily successively carry out, but can be at least part of the sub-step or stage of other steps or other steps in turn Or it alternately executes.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, computer program can be stored in a non-volatile computer and can be read In storage medium, the computer program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, the application To any reference of memory, storage, database or other media used in provided each embodiment, may each comprise non- Volatibility and/or volatile memory.Nonvolatile memory may include read-only memory (ROM), programming ROM (PROM), Electrically programmable ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include arbitrary access Memory (RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double data rate sdram (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) directly RAM (RDRAM), straight Connect memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..
Each technical characteristic of embodiment described above can be combined arbitrarily, for simplicity of description, not to above-mentioned reality It applies all possible combination of each technical characteristic in example to be all described, as long as however, the combination of these technical characteristics is not deposited In contradiction, all should be considered as described in this specification.
The embodiments described above only express several embodiments of the present invention, and the description thereof is more specific and detailed, but simultaneously Limitations on the scope of the patent of the present invention therefore cannot be interpreted as.It should be pointed out that for those of ordinary skill in the art For, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to guarantor of the invention Protect range.Therefore, the scope of protection of the patent of the invention shall be subject to the appended claims.

Claims (15)

1. a kind of text recognition method, which comprises
Text to be identified is obtained, target candidate word is obtained according to the character in the text to be identified;
Obtain the target text set of general field text collection and the corresponding target domain of the text to be identified;
Calculate target different degree of the target candidate word in the target text set and in the general field text The reference different degree of set;
According to the corresponding target different degree of the target candidate word and with reference to different degree be calculated the target candidate word with The target degree of correlation of the target domain;
According to the target degree of correlation using the target candidate word as the domain term of the target domain.
2. the method according to claim 1, wherein it is described according to the target degree of correlation by the target candidate After domain term of the word as the target domain, further includes:
The corresponding mapping character of the domain term is determined according to mapping relations, and the mapping relations include that shape closely maps, sound closely reflects At least one of penetrate;
Establish the incidence relation between the domain term and the mapping character.
3. the method according to claim 1, wherein the character according in the text to be identified obtains mesh Marking candidate word includes:
Initial candidate set of words is generated according to the proximity relations of character in the text to be identified;
Calculate in the initial candidate set of words word association degree of each initial candidate word in the target text set with And word independence degree;
It is generated according to the word that each initial candidate word is calculated in the word association degree and the word independence degree Degree;
It is screened from the initial candidate set of words according to the word generation degree of each initial candidate word and obtains the target Candidate word.
4. according to the method described in claim 3, it is characterized in that, each initial in the calculating initial candidate set of words Word association degree of the candidate word in the target text set include:
Corresponding associated confidence is determined according to frequency of occurrence of the initial candidate word in the target text set;
The word of the initial candidate word is determined according to probability of occurrence of the initial candidate word in the target text set Initial association degree;
Word target association degree is calculated according to the corresponding associated confidence of the initial candidate word and word initial association degree.
5. according to the method described in claim 3, it is characterized in that, the method also includes:
When the corresponding word independence degree of the initial candidate word be less than first threshold when, according to the initial candidate word and it is described just Adjacent character of the beginning candidate word in the text to be identified forms new initial candidate word;
The initial candidate set of words is added in the new initial candidate word.
6. a kind of text handling method, which comprises
Obtain initial input text;
Obtain the corresponding incidence relation of the corresponding target domain of the initial input text, the incidence relation is domain term and reflect Penetrate the incidence relation between character, the domain term is according to the corresponding text to be identified of the target domain, general field text This set target text set corresponding with the target domain identifies;
The corresponding target domain word of the initial input text is determined according to the initial input text and the incidence relation;
Target input text is obtained according to the whole initial input text of the target domain tone.
7. according to the method described in claim 6, it is characterized in that, described whole described initial defeated according to the target domain tone Enter text obtain target input text include:
Obtain the corresponding each candidate input word of the initial input text;
Word is constructed according to the component relationship of the word of the initial input text, the candidate input word, the target domain word Language relation chain set;
Calculate the transition probability for being transferred to current term in each word relation chain by forward direction word;
The bonding strength of the word relation chain is obtained according to the corresponding each transition probability of the word relation chain;
It is screened from the word relation chain set according to the bonding strength of the word relation chain and obtains target word relation chain, Text is inputted using the corresponding text of the target word relation chain as target.
8. according to the method described in claim 6, it is characterized in that, the acquisition initial input text includes:
The query statement inputted in the application is obtained, using the query statement as initial input text;
The method also includes:
Inquiry request is obtained, includes that the corresponding target of the query statement inputs text in the inquiry request;
It obtains and the inquiry response data that text obtains is inputted according to the target.
9. according to the method described in claim 6, it is characterized in that, the method also includes:
The text to be identified is obtained, target candidate word is obtained according to the character in the text to be identified;
Obtain the general field text collection and with the target text set;
Calculate target different degree of the target candidate word in the target text set and in the general field text The reference different degree of set;
According to the corresponding target different degree of the target candidate word and with reference to different degree be calculated the target candidate word with The target degree of correlation of the target domain;
According to the target degree of correlation using the target candidate word as the domain term of the target domain.
10. a kind of text identification device, described device include:
Target candidate word obtains module, for obtaining text to be identified, obtains target according to the character in the text to be identified Candidate word;
Set obtains module, for obtaining the mesh of general field text collection and the corresponding target domain of the text to be identified Mark text collection;
Different degree computing module, for calculate target different degree of the target candidate word in the target text set and In the reference different degree of the general field text collection;
The degree of correlation obtains module, for calculating according to the corresponding target different degree of the target candidate word and with reference to different degree To the target degree of correlation of the target candidate word and the target domain;
Domain term obtain module, for according to the target degree of correlation using the target candidate word as the neck of the target domain Domain word.
11. device according to claim 10, which is characterized in that described device further include:
Mapping character determining module, for determining the corresponding mapping character of the domain term according to mapping relations, the mapping is closed System includes that shape closely maps, sound at least one of closely maps;
Incidence relation establishes module, the incidence relation for establishing between the domain term and the mapping character.
12. device according to claim 10, which is characterized in that the target candidate word obtains module and is used for:
Initial candidate set of words is generated according to the proximity relations of character in the text to be identified;
Calculate in the initial candidate set of words word association degree of each initial candidate word in the target text set with And word independence degree;
It is generated according to the word that each initial candidate word is calculated in the word association degree and the word independence degree Degree;
It is screened from the initial candidate set of words according to the word generation degree of each initial candidate word and obtains the target Candidate word.
13. a kind of text processing apparatus, described device include:
Initial input text obtains module, for obtaining initial input text;
Incidence relation obtains module, for obtaining the corresponding incidence relation of the corresponding target domain of the initial input text, institute Incidence relation of the incidence relation between domain term and mapping character is stated, the domain term is corresponding according to the target domain What text, general field text collection and the corresponding target text set of the target domain to be identified identified;
Target domain word obtains module, for determining the initial input according to the initial input text and the incidence relation The corresponding target domain word of text;
Target input text obtains module, defeated for obtaining target according to the whole initial input text of the target domain tone Enter text.
14. a kind of computer equipment, which is characterized in that including memory and processor, be stored with computer in the memory Program, when the computer program is executed by the processor, so that the processor perform claim requires any one of 1 to 5 In text handling method described in any one of text recognition method and claim 6 to 9 claim described in claim The step of at least one method.
15. a kind of computer readable storage medium, which is characterized in that be stored with computer on the computer readable storage medium Program, when the computer program is executed by processor, so that the processor perform claim requires any one of 1 to 5 right It is required that in text handling method described in any one of the text recognition method and claim 6 to 9 claim at least A kind of the step of method.
CN201811168737.9A 2018-10-08 2018-10-08 Text recognition method, text processing method, text recognition device, text processing device, computer equipment and storage medium Active CN110162681B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811168737.9A CN110162681B (en) 2018-10-08 2018-10-08 Text recognition method, text processing method, text recognition device, text processing device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811168737.9A CN110162681B (en) 2018-10-08 2018-10-08 Text recognition method, text processing method, text recognition device, text processing device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN110162681A true CN110162681A (en) 2019-08-23
CN110162681B CN110162681B (en) 2023-04-18

Family

ID=67645117

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811168737.9A Active CN110162681B (en) 2018-10-08 2018-10-08 Text recognition method, text processing method, text recognition device, text processing device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110162681B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110765996A (en) * 2019-10-21 2020-02-07 北京百度网讯科技有限公司 Text information processing method and device
CN111552806A (en) * 2020-04-16 2020-08-18 重庆大学 Method for unsupervised construction of entity set in building field
CN111710328A (en) * 2020-06-16 2020-09-25 北京爱医声科技有限公司 Method, device and medium for selecting training samples of voice recognition model
CN112016305A (en) * 2020-09-09 2020-12-01 平安科技(深圳)有限公司 Text error correction method, device, equipment and storage medium
CN112101020A (en) * 2020-08-27 2020-12-18 北京百度网讯科技有限公司 Method, device, equipment and storage medium for training key phrase identification model
CN113744736A (en) * 2021-09-08 2021-12-03 北京声智科技有限公司 Command word recognition method and device, electronic equipment and storage medium
CN113743409A (en) * 2020-08-28 2021-12-03 北京沃东天骏信息技术有限公司 Text recognition method and device

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11238051A (en) * 1998-02-23 1999-08-31 Toshiba Corp Chinese input conversion processor, chinese input conversion processing method and recording medium stored with chinese input conversion processing program
US20040086179A1 (en) * 2002-11-04 2004-05-06 Yue Ma Post-processing system and method for correcting machine recognized text
US20040210434A1 (en) * 1999-11-05 2004-10-21 Microsoft Corporation System and iterative method for lexicon, segmentation and language model joint optimization
CN101572083A (en) * 2008-04-30 2009-11-04 富士通株式会社 Method and device for making up words by using prosodic words
CN106445906A (en) * 2015-08-06 2017-02-22 北京国双科技有限公司 Generation method and apparatus for medium-and-long phrase in domain lexicon
CN106681981A (en) * 2015-11-09 2017-05-17 北京国双科技有限公司 Chinese part-of-speech tagging method and device
CN106708893A (en) * 2015-11-17 2017-05-24 华为技术有限公司 Error correction method and device for search query term
CN107102746A (en) * 2016-02-19 2017-08-29 北京搜狗科技发展有限公司 Candidate word generation method, device and the device generated for candidate word
JP2017151804A (en) * 2016-02-25 2017-08-31 国立研究開発法人情報通信研究機構 Automatic translation feature weight optimization device and computer program for the same
CN108170674A (en) * 2017-12-27 2018-06-15 东软集团股份有限公司 Part-of-speech tagging method and apparatus, program product and storage medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11238051A (en) * 1998-02-23 1999-08-31 Toshiba Corp Chinese input conversion processor, chinese input conversion processing method and recording medium stored with chinese input conversion processing program
US20040210434A1 (en) * 1999-11-05 2004-10-21 Microsoft Corporation System and iterative method for lexicon, segmentation and language model joint optimization
US20040086179A1 (en) * 2002-11-04 2004-05-06 Yue Ma Post-processing system and method for correcting machine recognized text
CN101572083A (en) * 2008-04-30 2009-11-04 富士通株式会社 Method and device for making up words by using prosodic words
CN106445906A (en) * 2015-08-06 2017-02-22 北京国双科技有限公司 Generation method and apparatus for medium-and-long phrase in domain lexicon
CN106681981A (en) * 2015-11-09 2017-05-17 北京国双科技有限公司 Chinese part-of-speech tagging method and device
CN106708893A (en) * 2015-11-17 2017-05-24 华为技术有限公司 Error correction method and device for search query term
CN107102746A (en) * 2016-02-19 2017-08-29 北京搜狗科技发展有限公司 Candidate word generation method, device and the device generated for candidate word
JP2017151804A (en) * 2016-02-25 2017-08-31 国立研究開発法人情報通信研究機構 Automatic translation feature weight optimization device and computer program for the same
CN108170674A (en) * 2017-12-27 2018-06-15 东软集团股份有限公司 Part-of-speech tagging method and apparatus, program product and storage medium

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110765996A (en) * 2019-10-21 2020-02-07 北京百度网讯科技有限公司 Text information processing method and device
CN110765996B (en) * 2019-10-21 2022-07-29 北京百度网讯科技有限公司 Text information processing method and device
CN111552806A (en) * 2020-04-16 2020-08-18 重庆大学 Method for unsupervised construction of entity set in building field
CN111710328A (en) * 2020-06-16 2020-09-25 北京爱医声科技有限公司 Method, device and medium for selecting training samples of voice recognition model
CN111710328B (en) * 2020-06-16 2024-01-12 北京爱医声科技有限公司 Training sample selection method, device and medium for speech recognition model
CN112101020A (en) * 2020-08-27 2020-12-18 北京百度网讯科技有限公司 Method, device, equipment and storage medium for training key phrase identification model
CN112101020B (en) * 2020-08-27 2023-08-04 北京百度网讯科技有限公司 Method, apparatus, device and storage medium for training key phrase identification model
CN113743409A (en) * 2020-08-28 2021-12-03 北京沃东天骏信息技术有限公司 Text recognition method and device
CN112016305A (en) * 2020-09-09 2020-12-01 平安科技(深圳)有限公司 Text error correction method, device, equipment and storage medium
CN112016305B (en) * 2020-09-09 2023-03-28 平安科技(深圳)有限公司 Text error correction method, device, equipment and storage medium
CN113744736A (en) * 2021-09-08 2021-12-03 北京声智科技有限公司 Command word recognition method and device, electronic equipment and storage medium
CN113744736B (en) * 2021-09-08 2023-12-08 北京声智科技有限公司 Command word recognition method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN110162681B (en) 2023-04-18

Similar Documents

Publication Publication Date Title
CN110162681A (en) Text identification, text handling method, device, computer equipment and storage medium
CN111061856B (en) Knowledge perception-based news recommendation method
JP2021089739A (en) Question answering method and language model training method, apparatus, device, and storage medium
CN109858010B (en) Method and device for recognizing new words in field, computer equipment and storage medium
CN110121705A (en) Pragmatics principle is applied to the system and method interacted with visual analysis
CN107220386A (en) Information-pushing method and device
CN110162519A (en) Data clearing method
CN109684627A (en) A kind of file classification method and device
CN107704512A (en) Financial product based on social data recommends method, electronic installation and medium
CN109087205A (en) Prediction technique and device, the computer equipment and readable storage medium storing program for executing of public opinion index
CN112328909B (en) Information recommendation method and device, computer equipment and medium
CN112035595A (en) Construction method and device of audit rule engine in medical field and computer equipment
CN112926308B (en) Method, device, equipment, storage medium and program product for matching text
CN108932320A (en) Article search method, apparatus and electronic equipment
CN106610932A (en) Corpus processing method and device and corpus analyzing method and device
CN113987182A (en) Fraud entity identification method, device and related equipment based on security intelligence
CN112328869A (en) User loan willingness prediction method and device and computer system
CN115204971A (en) Product recommendation method and device, electronic equipment and computer-readable storage medium
CN113220900B (en) Modeling Method of Entity Disambiguation Model and Entity Disambiguation Prediction Method
CN114265835A (en) Data analysis method and device based on graph mining and related equipment
CN110389963A (en) The recognition methods of channel effect, device, equipment and storage medium based on big data
CN112749238A (en) Search ranking method and device, electronic equipment and computer-readable storage medium
CN106575418A (en) Suggested keywords
CN110008282A (en) Transaction data synchronization interconnection method, device, computer equipment and storage medium
CN113779994B (en) Element extraction method, element extraction device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant