CN110162681A - Text identification, text handling method, device, computer equipment and storage medium - Google Patents
Text identification, text handling method, device, computer equipment and storage medium Download PDFInfo
- Publication number
- CN110162681A CN110162681A CN201811168737.9A CN201811168737A CN110162681A CN 110162681 A CN110162681 A CN 110162681A CN 201811168737 A CN201811168737 A CN 201811168737A CN 110162681 A CN110162681 A CN 110162681A
- Authority
- CN
- China
- Prior art keywords
- target
- text
- word
- degree
- domain
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The present invention relates to a kind of text identification, text handling method, device, computer equipment and storage medium, the text handling method includes: to obtain initial input text;Obtain the corresponding incidence relation of the corresponding target domain of the initial input text, incidence relation of the incidence relation between domain term and mapping character, the domain term are identified according to the corresponding target text set of text to be identified, general field text collection and the target domain of the target domain;The corresponding target domain word of the initial input text is determined according to the initial input text and the incidence relation;Target input text is obtained according to the whole initial input text of the target domain tone.The above method is high for the accuracy for the target input text that specific area adjusts.
Description
Technical field
The present invention relates to internet areas, more particularly to text identification, text handling method, device, computer equipment
And storage medium.
Background technique
With the fast development of internet, the problem of information overload, becomes increasingly conspicuous.The word occurred in network is more and more,
There are the needs for the information that the information that user inputs is adjusted to actual needs input under many scenes, for example, according to input
Phonetic show candidate word or error correction etc. carried out to the word of user's input.
Currently, when the information for needing to be inputted according to user determines the information of actual needs input, usually from dictionary
The similar words of the word of user's input or the word with similar pinyin are screened, therefore the word quantity that screening obtains is more, and
The information relevance that often actually enters with user is little, and accuracy is low.
Summary of the invention
Based on this, it is necessary to for above-mentioned problem, provide a kind of text identification, text handling method, device, computer
Equipment and storage medium, due to can be according to text to be identified, general field text collection and the corresponding target of text to be identified
The corresponding text collection in field identifies to obtain the domain term of target domain, therefore the domain term and the phase of target domain that identification obtains
Pass degree is big, and the accuracy of text identification and text-processing is high.
A kind of text recognition method, which comprises text to be identified is obtained, according to the word in the text to be identified
Symbol obtains target candidate word;Obtain the target text of general field text collection and the corresponding target domain of the text to be identified
This set;Calculate target different degree of the target candidate word in the target text set and in the general field text
The reference different degree of this set;Institute is calculated according to the corresponding target different degree of the target candidate word and with reference to different degree
State the target degree of correlation of target candidate word Yu the target domain;According to the target degree of correlation using the target candidate word as
The domain term of the target domain.
In one embodiment, described according to the corresponding target different degree of the target candidate word and with reference to different degree meter
It includes: according to the corresponding mesh of the target candidate word that calculation, which obtains the target candidate word and the target degree of correlation of the target domain,
Mark different degree and the initial degree of correlation that the target candidate word Yu the target domain are calculated with reference to different degree;According to institute
It states frequency of occurrence of the target candidate word in the target text set and determines corresponding degree of correlation confidence level;According to described initial
The degree of correlation and the degree of correlation confidence level obtain the target degree of correlation.
In one embodiment, the text handling method further include: detect the corresponding target of the target input text
Type;When the corresponding target type of target input text is preset kind, the initial input text is filtered.
A kind of text identification device, described device includes: that target candidate word obtains module, for obtaining text to be identified,
Target candidate word is obtained according to the character in the text to be identified;Set obtains module, for obtaining general field text set
The target text set of conjunction and the corresponding target domain of the text to be identified;Different degree computing module, it is described for calculating
Target different degree of the target candidate word in the target text set and the reference weight in the general field text collection
It spends;The degree of correlation obtains module, for calculating according to the corresponding target different degree of the target candidate word and with reference to different degree
Obtain the target degree of correlation of the target candidate word Yu the target domain;Domain term obtains module, for according to the target
The degree of correlation is using the target candidate word as the domain term of the target domain.
A kind of computer equipment, including memory and processor are stored with computer program, the meter in the memory
When calculation machine program is executed by the processor, so that the step of processor executes above-mentioned text recognition method.
A kind of computer readable storage medium is stored with computer program on the computer readable storage medium, described
When computer program is executed by processor, so that the step of processor executes above-mentioned text recognition method.
Above-mentioned text recognition method, device, computer equipment and storage medium.When needing to carry out word identification, by obtaining
Text to be identified is taken, target candidate word is obtained according to the character in text to be identified;Obtain general field text collection and to
Identify the target text set of the corresponding target domain of text;It is important to calculate target of the target candidate word in target text set
Degree and general field text collection reference different degree;According to the corresponding target different degree of target candidate word and with reference to weight
Spend the target degree of correlation that target candidate word and target domain is calculated;According to the target degree of correlation using target candidate word as mesh
The domain term in mark field.Due to obtaining target candidate word according to text to be identified, and target candidate word is in the text of target domain
Set compares with the different degree in the text collection of general field, can embody target candidate word journey related to target domain
Degree, therefore the relevant domain term of accurate target domain corresponding to text to be identified can be obtained, accuracy height.
A kind of text handling method, which comprises obtain initial input text;Obtain the initial input text pair
The corresponding incidence relation of the target domain answered, incidence relation of the incidence relation between domain term and mapping character are described
Domain term is corresponding according to the corresponding text to be identified of the target domain, general field text collection and the target domain
Target text set identify;The initial input text is determined according to the initial input text and the incidence relation
This corresponding target domain word;Target input text is obtained according to the whole initial input text of the target domain tone.
A kind of text processing apparatus, described device includes: that initial input text obtains module, for obtaining initial input text
This;Incidence relation obtains module, described for obtaining the corresponding incidence relation of the corresponding target domain of the initial input text
Incidence relation of the incidence relation between domain term and mapping character, the domain term be according to the target domain it is corresponding to
What identification text, general field text collection and the corresponding target text set of the target domain identified;Target domain
Word obtains module, for determining the corresponding mesh of the initial input text according to the initial input text and the incidence relation
Mark domain term;Target input text obtains module, for being obtained according to the whole initial input text of the target domain tone
Target inputs text.
A kind of computer equipment, including memory and processor are stored with computer program, the meter in the memory
When calculation machine program is executed by the processor, so that the step of processor executes above-mentioned text handling method.
A kind of computer readable storage medium is stored with computer program on the computer readable storage medium, described
When computer program is executed by processor, so that the step of processor executes above-mentioned text handling method.
Above-mentioned text handling method, device, computer equipment and storage medium, can be corresponding according to the target domain of application
Domain term and mapping character between relationship determine the corresponding domain term of the text that inputs in the application, and according to domain term pair
Initial input text is adjusted, and obtains target input text.Since domain term is text to be identified, the general neck according to application
What the text identification of domain text and target domain obtained, be the relevant word of target domain, therefore adjust for specific area
The accuracy of obtained target input text is high.
Detailed description of the invention
Fig. 1 is the applied environment figure of the text handling method and text recognition method that provide in one embodiment;
Fig. 2 is the flow chart of text recognition method in one embodiment;
Fig. 3 A is the flow chart of text recognition method in one embodiment;
Fig. 3 B is the flow chart for the incidence relation established between domain term and mapping character in one embodiment;
Fig. 4 is to obtain the flow chart of target candidate word according to the character in text to be identified in one embodiment;
Fig. 5 is the flow chart of text handling method in one embodiment;
Fig. 6 is to obtain the process that target inputs text according to the whole initial input text of target domain tone in one embodiment
Figure;
Fig. 7 is to obtain the schematic diagram of word relation chain in one embodiment;
Fig. 8 is to obtain the schematic diagram that target inputs text according to the transition probability of word relation chain in one embodiment;
Fig. 9 is to show that the corresponding target of initial input text inputs the schematic diagram of text in one embodiment;
Figure 10 is the flow chart of text handling method in one embodiment;
Figure 11 is to carry out error correction to initial input text in one embodiment, obtains the schematic diagram of target input text;
Figure 12 is the structural block diagram of text identification device in one embodiment;
Figure 13 is the structural block diagram of text identification device in one embodiment;
Figure 14 is the structural block diagram of text processing apparatus in one embodiment;
Figure 15 is the internal structure block diagram of computer equipment in one embodiment;
Figure 16 is the internal structure block diagram of computer equipment in one embodiment.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right
The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and
It is not used in the restriction present invention.
It is appreciated that term " first " used in this application, " second " etc. can be used to describe various elements herein,
But unless stated otherwise, these elements should not be limited by these terms.These terms are only used to by first element and another yuan
Part is distinguished.For example, in the case where not departing from scope of the present application, first threshold can be known as to second threshold, and class
As, second threshold can be known as first threshold.
Fig. 1 is the applied environment figure of the text handling method and text recognition method that provide in one embodiment, such as Fig. 1
It is shown, in the application environment, including terminal 110 and server 120.
In one embodiment, when needing to obtain domain term, what user can select to get in advance in terminal 110
Target text set and general field text collection are identified to 120 sending information of server by terminal 110 and are instructed, and are identified
Target text set and general field text collection is carried in instruction to obtain after server 120 receives text identification instruction
Text to be identified is taken, text recognition method provided in an embodiment of the present invention is executed, obtains the domain term of target domain.Server
120 are stored in the domain term of target domain in dictionary.
Domain term is the proprietary word of specific area, frequently appears in certain specific areas, and seldom uncorrelated at other
Field occurs, such as domain term can be the frequency of occurrences in specific area greater than the first predeterminated frequency, in going out for general field
Showing word of the frequency less than the second predeterminated frequency, the first predeterminated frequency and the second predeterminated frequency can according to need setting, the
One predeterminated frequency is greater than the second predeterminated frequency.Such as the vocabulary such as thread, compiler are the domain term in computer field, are generally existed
Occur in the professional article of computer field, and seldom occurs in other uncorrelated field such as medical fields." financing is logical ",
Words such as " current incomes " are the domain term in finance and money management field, are generally occurred in the article shown in the application of financial class.One
The domain term of a professional domain can be it is continually changing, for example, " financing logical " is title of financing application, for " financing is logical "
This word not is the domain term in finance and money management field, but with " financing before " financing is logical " application puts goods on the market
It is logical " application put goods on the market after using more and more extensive, " financing is logical " frequency of occurrences is increasingly in the article in finance and money management field
Height, " financing is logical " become the domain term in finance and money management field.
In one embodiment, when user inquires, for example, when needing to seek advice from intelligence visitor in shopping application
When taking related problem, user input inquiry sentence, terminal 110 can send server for query statement in terminal 110
In 120, server 120 executes text handling method provided in an embodiment of the present invention using query statement as initial input text,
Target input text is obtained, and obtains the corresponding inquiry response data of target input text, inquiry response data are returned into end
End 110, terminal 110 show the inquiry response data.
It is appreciated that above-mentioned application environment is only a kind of example, do not constitute at text provided in an embodiment of the present invention
The limitation of reason method.In some embodiments, there may also be other application environments.Such as this hair can also be executed by terminal
The text handling method that bright embodiment provides, server 120 can also obtain the model in forum as initial input text, hold
Row text handling method provided in an embodiment of the present invention obtains target input text, and inputs text according to target and determine forum
In model whether be advertisement, determine the model in forum be advertisement when, filter the advertising information.Server 120 can also be with
It is stored in advance in target text set and general field text collection, and obtains new text to be identified every preset duration,
Execute text recognition method.
Server 120 can be independent physical server and be also possible to the server set that multiple physical servers are constituted
Group can be to provide the Cloud Server of the basic cloud computing service such as Cloud Server, cloud database, cloud storage and CDN.Terminal 110
It can be smart phone, tablet computer, laptop and desktop computer etc., however, it is not limited to this.Terminal 110 and
Server 120 can be attached by network.
As shown in Fig. 2, in one embodiment it is proposed that a kind of text recognition method, the present embodiment is mainly in this way
Applied in above-mentioned Fig. 1 terminal 110 or server 120 illustrate.It can specifically include following steps:
Step S202 obtains text to be identified, obtains target candidate word according to the character in text to be identified.
Specifically, text to be identified is the text for needing to identify domain term, and text to be identified can be to be instructed according to identification
It obtains.Identification instruction can be to be also possible to be touched according to pre-set trigger condition according to the real-time operation of user triggering
Hair, pre-set trigger condition for example can be setting and trigger identification instruction every preset duration, to obtain text to be identified
This.One or more of text to be identified and the corresponding mark of text to be identified can be carried in identification instruction, for example, knowing
Not Zhi Ling in can carry the corresponding storage location of text to be identified, text to be identified is obtained according to storage location.Target candidate
Root is obtained according to the character in text to be identified.Target candidate word can be the word of character composition adjacent in text to be identified
Language.The number of character specifically can according to need setting in target candidate word, such as can be 2 or 3.
In one embodiment, text to be identified can be in the corresponding application of target domain, each according to application
Content in a page obtains text to be identified.Using the content in each page can be application in issue information data,
Corresponding one exchanged in the information issued in forum of session sentence and application for carrying out customer service consulting in and generating
Or it is multiple.Due to text to be identified in the application, used in the relevant financial information such as issued in financing application, application of managing money matters
The sentence etc. that the query statement of family input and customer service are answered generally is the data strong with the field correlation of the application, because
This, a possibility that target candidate word using the character composition in text to be identified is domain term, is high, therefore text knowledge can be improved
Other efficiency.
It in one embodiment, can be using the candidate word being made of adjacent character each in application as target candidate
Word, can also the character to text to be identified further screen and obtain target candidate word.This is had determined as example, can filter
The word of the domain term in field, or the candidate word for again forming adjacent character after modal particle in text to be identified is removed are made
For target candidate word.
In one embodiment, target candidate word can be generated according to the proximity relations of character in text to be identified, such as will
Adjacent character forms target candidate word.Wherein, the number of target candidate word can according to need determination, such as can be 2 or 3
It is a.For an actual example, it is assumed that text to be identified is " abcdefg ", then can regard " ab ", " bc " etc. as target candidate
Word can also regard " abc ", " bcd " etc. as target candidate word.If " ab " and is confirmed as the domain term in the field,
Then " ab " can not be used as target candidate word.
It in one embodiment, can also be according at least one of word association degree and word independence degree to be identified
Target candidate word is chosen in text.For example, word association degree can be selected to be higher than preset value, and word independence degree is higher than preset value
Candidate word is as target candidate word.
Step S204 obtains the target text collection of general field text collection and the corresponding target domain of text to be identified
It closes, includes general field text in general field text collection.
Specifically, field belonging to text to be identified is target domain, and target domain belonging to text to be identified can root
It is determined according to the source of text to be identified, for example, if text to be identified is the financing article in financing application, then text to be identified
Affiliated target domain is financing field.Target domain belonging to text to be identified is also possible to be obtained according to the realm information of input
It arrives.When needing to identify the domain term in a certain field, the text to be identified and text to be identified of user's input can receive
Corresponding field.Target text set is the text collection of target domain, and target domain is the field where text to be identified.Mesh
Mark field can be social field, financial field or medical field etc., with specific reference to it needs to be determined that.The text set of target domain
It closes can be and imported into server and server acquires from the corresponding application of target domain.It is also possible to from it
Target text is obtained in his data source forms target text set.General field text collection is obtained by general field text combination
The text collection arrived, general field text refers to the weak text of field specific aim, general for strongly professional text
There is no unique suitable application areas for text, but can be blanket.General field text for example can be Domestic News etc.
Text can crawl Domestic News as general field text, or by depositing after manually choosing general field text on network
Storage is in the server.It is appreciated that general field text collection and target text set, which can store, is executing text identification
In the server of method, it is also possible to be stored in other servers.In general field text collection and target text set
One or more be also possible to obtain in real time.For example, can be obtained 10 days in the application when needing to carry out text identification
The text of interior publication forms target text set.The quantity of text in text collection can according to need setting, for example, general
The quantity of text in the text collection of field can be 10,000, and the quantity of text can be 100 in target text set.
In one embodiment, it is formed due to general field text collection for the text of general field, data source ratio
It is relatively abundant, and target text collection is combined into the text of target domain, compares less, therefore text in general field text collection
Quantity is more than the quantity of text in target text set.Such as text in general field text collection and target text set
Quantitative proportion can be 10:1.It, can be using an article as a text, and for question and answer language when calculating the quantity of text
Sentence, can be using each query statement as a text, can also be using a complete inquiry session as a text, tool
Body can according to need setting.
Step S206 calculates target different degree of the target candidate word in target text set and in general field text
The reference different degree of set.
Specifically, different degree is for indicating that significance level of the word in text collection, different degree are bigger, then it represents that the time
Select different degree of the word in text collection higher.Target different degree refers to that target candidate word is important in target text set
Degree, refers to different degree of the target candidate word in general field text collection with reference to different degree.The calculation method of different degree can be with
Determine that frequency of occurrence is more, then corresponding different degree is big according to the frequency of occurrence of target candidate word.For example, can be by frequency of occurrence
As different degree, or using the frequency of occurrences obtained according to frequency of occurrence as different degree.
In one embodiment, different degree can be determined according to the frequency of occurrences of the target candidate word in text collection.With
Formula is expressed as follows: Pg (w)=Cg (w)/Cg (ALL);Pf (w)=Cf (w)/Cf (ALL).Wherein, w indicates target candidate word, P
Indicate frequency, g indicates that general field text collection, f indicate target text set, and C indicates frequency of occurrence.Therefore, Pg (w) is indicated
The frequency of occurrences of the target candidate word w in general field text collection g, Pf (w) indicate target candidate word w in target text set
The frequency of occurrences in f, Cg (w) indicate frequency of occurrence of the target candidate word w in general field text collection g, and Cf (w) indicates mesh
Mark frequency of occurrence of the candidate word w in target text set f.Cg (ALL) indicates of word in general field text collection g
Number.Cf (ALL) indicates the number of word in target text set f.Wherein the number of word can be character in text collection
Number is also possible to the number of the word obtained after being segmented.
Target candidate is calculated according to the corresponding target different degree of target candidate word and with reference to different degree in step S208
The target degree of correlation of word and target domain.
Specifically, the target degree of correlation indicates that the degree of correlation of target candidate word and target domain, the degree of correlation are bigger, then it represents that
The target candidate word is more related to target domain.Target different degree and target degree of correlation positive correlation, with reference to different degree with
The target degree of correlation is negatively correlated relationship.I.e. target different degree increases, then the target degree of correlation becomes larger, and refers to different degree and increase, then
The target degree of correlation becomes smaller.In one embodiment, the target degree of correlation can be what target different degree was obtained divided by reference different degree
Quotient.
In one embodiment, mesh is calculated according to the corresponding target different degree of target candidate word and with reference to different degree
Mark candidate word and the target degree of correlation of target domain include: according to the corresponding target different degree of target candidate word and with reference to important
The initial degree of correlation of target candidate word and target domain is calculated in degree;According to target candidate word going out in target text set
Occurrence number determines corresponding degree of correlation confidence level;The target degree of correlation is obtained according to the initial degree of correlation and degree of correlation confidence level.
Specifically, degree of correlation confidence level indicates the order of accuarcy of the initial degree of correlation, degree of correlation confidence level and target candidate word
Frequency of occurrence correlation in target text set.It, may be due to target candidate due to when calculating the degree of correlation
Word is smaller in the frequency of occurrence of target text set and general field text collection, and the target degree of correlation obtained from compares
High situation, therefore the confidence system of the degree of correlation can be obtained according to frequency of occurrence of the target candidate word in target text set
Number, is adjusted the degree of correlation, and the target degree of correlation accuracy made is high.In one embodiment, the meter of the target degree of correlation
Calculation method formula can be expressed as follows: X (w)=Pf (w)/Pg (w) * log2(Cf (w)), wherein X (w) indicates that target is related
Degree, the initial degree of correlation of the quotient representation that Pf (w)/Pg (w) is obtained, log2Cf (w) indicates degree of correlation confidence level.Due to calculating target
When the degree of correlation, it is contemplated that frequency of occurrence of the target candidate word in target text set determines degree of correlation confidence level, therefore energy
Access the more accurate target degree of correlation.
Step S210, according to the target degree of correlation using target candidate word as the domain term of target domain.
It specifically, can be using target candidate word as mesh if the target degree of correlation is greater than or is more than or equal to second threshold
The domain term in mark field can determine that target candidate word is not target domain if the target degree of correlation is less than second threshold
Domain term.Or when the target degree of correlation is less than preset threshold, then whether the target candidate word can be determined in conjunction with other methods
For the domain term of target domain.For example, the target degree of correlation is less than second threshold and is shown greater than the target candidate word of third threshold value
Show on the display interface of terminal, determines whether the target candidate word is target according to selection operation of the user to target candidate word
The domain term in field can be with if selection operation is to confirm that target candidate word is the corresponding operation of domain term of target domain
Using target candidate word as the domain term of target domain.
In one embodiment, if target candidate word have it is multiple, can according to the target degree of correlation size from greatly to
Small to be ranked up, P target candidate words are as the domain term of target domain before being ordered as.Or m before being ordered as
In target candidate word, the target degree of correlation is greater than domain term of the target candidate word of the 4th threshold value as target domain.Wherein, P, m
For the integer greater than 1, P, m, second threshold, third threshold value and the 4th threshold value specific value can according to need setting.
Above-mentioned text recognition method, device, computer equipment and storage medium.By obtaining text to be identified, according to
Character in identification text obtains target candidate word;Obtain general field text collection and the corresponding target neck of text to be identified
The target text set in domain;Calculate target different degree of the target candidate word in target text set and in general field text
The reference different degree of set;Target candidate is calculated according to the corresponding target different degree of target candidate word and with reference to different degree
The target degree of correlation of word and target domain;According to the target degree of correlation using target candidate word as the domain term of target domain.Due to
It is according to obtaining target candidate word in text to be identified, and target candidate word is in the text collection and general field of target domain
Different degree in text collection compares, and whether related to target domain can embody target candidate word, therefore can obtain standard
The relevant domain term of true target domain corresponding to text to be identified, accuracy are high.
After obtaining domain term, domain term can be stored in the corresponding dictionary of target domain, the neck being stored in dictionary
Domain word can be used for judging text whether be target domain text, the word in application can also be adjusted.Such as basis
The field of application by user input there are the words of mistake to be modified to corresponding domain term.In one embodiment, work as user
In this application after the corresponding pinyin character of input domain term, using the domain term as the corresponding candidate word of pinyin character, to mention
The high efficiency for inputting word in the application.
For example, it is assumed that user needs to link up with artificial intelligence customer service in financing application, the sentence of input is that " I will be how
Show interest in in showing interest in ", then the sentence can be modified to " I will how the financing in financing is logical ", then obtain that " I will be how
The corresponding answer statement of the financing in financing is logical ", and export answer statement.
In one embodiment, as shown in Figure 3A, according to the target degree of correlation using target candidate word as the neck of target domain
After the word of domain, further includes:
Step S302 determines the corresponding mapping character of domain term according to mapping relations, mapping relations include shape closely map, sound
At least one of nearly mapping.
Specifically, shape closely maps the mapping relations for referring to word similar in character form structure, and sound, which closely maps, to be referred to similar in phonetic notation
The mapping relations of character.Whether character form structure is close and the whether similar rule of phonetic notation can according to need setting.For shape
Whether nearly mapping, can determine similar between word and word according to nearly word form dictionary.For phonetic notation, if then can be set phonetic notation it
Between it is identical, difference one or more of one phonetic symbol and two phonetic symbols be phonetic notation it is close.Mapping character can be with
It is mapping one or two of word and corresponding phonetic symbol.Mapping relations are pre-set, therefore after obtaining domain term,
The corresponding mapping character of domain term can be obtained according to mapping relations.
For an actual example, it is assumed that obtain " financing is logical " as the corresponding domain term in financing field, and closely mapped according to shape
The nearly word form of relationship available " reason " is " inner ", then " financing is logical " corresponding mapping character may include " inner wealth is logical ".And root
Obtain that the nearly character of the sound " led to " is " same " and the nearly phonetic character of corresponding sound is " ton " according to the nearly mapping relations of sound, therefore, then
" financing is logical " corresponding mapping character may include " financing is same " and " licaiton ".
Step S304 establishes the incidence relation between domain term and mapping character.
It specifically, can be by domain term and corresponding mapping character associated storage, it is established that field after obtaining mapping character
Incidence relation between word and mapping character.Such as can establish association dictionary, field of storage word and mapping character in dictionary
Between incidence relation.
In one embodiment, the incidence relation of domain term and mapping character can be through dictionary realization, domain term
Incidence relation between mapping character can store in error correction map dictionary, error correction map dictionary include the nearly mapping table of sound with
And the nearly mapping table of shape.It as shown in Figure 3B, is the stream for the incidence relation established in one embodiment between domain term and mapping character
Cheng Tu.After obtaining new field set of words using new word discovery module, it can use pre-set nearly word form dictionary and obtain
The corresponding nearly word form of domain term, establishes the mapping relations of nearly word form and domain term, obtains similar words mapping table, similar words mapping table
As shown in table 1.It can use phonetic notation module and obtain the nearly phonetic character of the corresponding sound of new domain term, establish the nearly word of sound and domain term
Mapping relations, obtain the nearly mapping table of sound, the nearly error correction map table of sound is as shown in table 2.By Tables 1 and 2 storage to error correction map word
Library.
In one embodiment, the incidence relation of non-domain term and mapping character can also be stored in above-mentioned dictionary.For example,
Assuming that " what " is non-domain term, mapping table can be as shown in table 3.
Table 3
Phonetic | Vocabulary |
licaitong | Financing is logical |
shouyi | Income |
licai | Financing |
shenme | What |
In one embodiment, above-mentioned incidence relation can be what online dynamic updated, such as can be every preset time
The method for executing text identification, obtains domain term, obtains the corresponding mapping character of domain term, establish domain term and mapping character it
Between incidence relation.In this way, new domain term can be obtained constantly.
In one embodiment, as shown in figure 4, obtaining target candidate word according to the character in text to be identified and including:
Step S402 generates initial candidate set of words according to the proximity relations of character in text to be identified.
Specifically, the number of character can according to need determination in initial candidate word, for example, it may be 2 or 3.
Each character in initial candidate word is adjacent character in text to be identified.It, can will be to when obtaining text to be identified
Identify the character combination of character composition of arbitrary neighborhood in text as initial candidate word.Initial candidate in initial candidate set of words
The number of word can according to need determination.For example, it may be by the character group of character composition adjacent two-by-two in text to be identified
It closes and is used as initial candidate word, be also possible to that the word of target domain, invalid word such as modal particle, power-assist will be had determined as
Word obtains initial candidate set of words after removing.In one embodiment, can by according in text to be identified character it is neighbouring
The character combination that relationship generates is compared with the word in dictionary and/or dictionary, the word that will be not present in dictionary and/or dictionary
Language is as initial candidate word, in this way, it is possible to reduce the number of initial candidate word, and obtained neologisms.
It is initial to calculate word of each initial candidate word in target text set in initial candidate set of words by step S404
The degree of association and word independence degree.
Specifically, word association degree is used to indicate the tightness degree between the character of composition word.High initial of the degree of association
The probability that candidate word occurs in the application is big.Word independence degree refers to the possibility degree of the word separate words.Word independence degree is high,
Illustrate that a possibility that initial candidate word is a complete word is high.Word association degree and word independence degree are according to target text
What set obtained.
In one embodiment, word association degree can use the PMI (Pointwise of initial candidate word
MutualInformation, mutual information between point) it indicates.What mutual information PMI was measured is the correlation between two stochastic variables between point
Property.The each character that probability and initial candidate word that initial candidate word occurs can be obtained according to target text set occurs
Probability, it is mutual between put according to the probability that the probability and each character that initial candidate word occurs in target text set occur
Information.For example, mutual information can be calculated with formula (1) between then putting, wherein p for the initial candidate word being made of " xy "
(xy) referring to the probability that initial candidate word " xy " occurs, p (x), p (y) respectively refer to the probability that " x " and " y " occur, P (y | x)=C
(xy)/C (x), P (xy)=P (x) * P (y | x), P (x)=C (x)/C (ALL), P (y)=C (y)/C (ALL).C(xy),C(x),C
(y) refer to the number that " xy ", " x ", " y " occur in target text set.In P (y | x) feeling the pulse with the finger-tip mark text collection, occur at " x "
Under conditions of, the latter character is the probability of " y ".
In one embodiment, PMI can be normalized, using obtained normalization PMI as word association degree.?
In one embodiment, normalize PMI calculation method be formulated it is as follows: N__PMI=
PMI/H (x) or N_PMI=PMI/H (y), wherein N__PMI, which refers to, normalizes PMI, H (x)=
P (x) * log2P (x), H (x)=P (y) * log2P (y) can take one in PMI/H (x) and PMI/H (y)
As normalization PMI, such as using smaller value therein as normalization PMI, therefore when calculating normalization PMI, available H
(x) with the smaller value in H (y), PMI is divided by with smaller value, obtains normalization PMI.
In one embodiment, word independence degree can be determined according to the entropy of initial candidate word.The entropy of initial candidate word can
To be at least one of left entropy and right entropy.Entropy is for indicating information content.Left entropy indicates initial candidate word information content above,
Right entropy indicates the information content of initial candidate word hereafter.The left entropy and right entropy of initial candidate word embody the upper and lower of initial candidate word
The active degree of text, if left entropy is high, the object that illustrates to arrange in pairs or groups above enriches, if right entropy is high, illustrates object of hereafter arranging in pairs or groups
It is abundant.And object of arranging in pairs or groups is abundant, then illustrates that initial candidate word freedom degree is relatively high, therefore high a possibility that separate words.And entropy
It is low, then show that collocation object is relatively simple, needs to carry out collocation with fixed character could to use, therefore a possibility that separate words
It is relatively low.Wherein, the calculation formula of left entropy and right entropy can be indicated such as (2), (3), wherein EL(W) refer to a left side for initial candidate word
Entropy, ER(W) refer to the right entropy of initial candidate word.It is located at the character set on the initial candidate word left side in A feeling the pulse with the finger-tip mark text collection, a is
There is the probability of W, P (Wb/W) in p (aW/a) feeling the pulse with the finger-tip mark text collection in the case where there is a in character in character set A
In feeling the pulse with the finger-tip mark text collection in the case where there is W, there is the probability of b.
In one embodiment, word independence degree can according to the left entropy of initial candidate word and right entropy and determine.It is left
The sum formula of entropy and right entropy can be expressed as follows: E=EL(W)+ERIt (W), can be only as the word of initial candidate word using E
Vertical degree.
Step S406 is generated according to the word that each initial candidate word is calculated in word association degree and word independence degree
Degree.
Specifically, word generation degree is for a possibility that measuring the initial candidate word word newly-generated as one.Word
The language degree of association and word generation degree correlation, word independence degree and word generation degree correlation.In a reality
It applies in example, corresponding word generation degree mapping value can be obtained according to word independence degree, be generated according to word association degree and word
Degree mapping value obtains word generation degree.
In one embodiment, the word of each initial candidate word is calculated according to word association degree and word independence degree
Language generation degree includes: to determine corresponding associated confidence according to frequency of occurrence of the initial candidate word in target text set;Root
The word initial association degree of initial candidate word is determined according to probability of occurrence of the initial candidate word in target text set;According to initial
Word target association degree is calculated in the corresponding associated confidence of candidate word and word initial association degree.
Specifically, associated confidence reflects the confidence level for the word association degree being calculated.Word initial association degree can
It is obtained in the method with reference to above-mentioned calculating PMI.Due to calculate word association spend when, may due in text word sum
To measure less, the probability for causing initial candidate word to occur is high, thus the situation for keeping the word association degree being calculated high, therefore can be with
Degree of association confidence level is determined according to word frequency of occurrence.After initial association degree is calculated, according to degree of association confidence level to first
The beginning degree of association is adjusted, and obtains word target association degree.For example, word target association degree can be word initial association degree with
The product of degree of association confidence level.
In one embodiment, the calculation formula of word generation degree can be expressed as follows with formula (4), wherein U (W) table
Show that the word generation degree of initial candidate word W, N_PMI (W) indicate that the word association degree of initial candidate word W, C (W) refer to initial candidate
Frequency of occurrence of the word W in target text set.H (W) indicates the corresponding penalty value of word independence degree, wherein word independence degree
Penalty value determined according to the corresponding range of word independence degree.The corresponding penalty value of the big range of the numerical value range pair smaller than numerical value
The penalty value answered is small.For example, can be set when word independence degree is less than 1, penalty value 3.It can be set when word independence degree
Greater than 1 and when being equal to 1, penalty value 0.
U (W)=N_PMI (W) * log (C (W))-h (W) (4)
Step S408 is screened from initial candidate set of words according to the word generation degree of each initial candidate word and is obtained target
Candidate word.
In one embodiment, word generation degree can be greater than to the initial candidate word of the 5th threshold value as target candidate
Word can also be ranked up the word generation degree in initial candidate set of words according to the sequence of word generation degree from big to small,
Using sequence in preceding d a initial candidate word as target candidate word.Alternatively, before being ordered as in e initial candidate words, word
Independent degree is greater than the initial candidate word of the 6th threshold value as target candidate word.Wherein, d, e, the 5th threshold value and the 6th threshold value
Specific value can according to need setting.
In one embodiment, text to be identified is obtained from target text set, by word association degree and
Word independence degree screens to obtain target candidate word, can get newly generated neologisms in target text set, and then to neologisms
It whether is that domain term is judged, therefore new domain term can be got according to target text set.
In one embodiment, text recognition method is further comprising the steps of: when the corresponding word of initial candidate word is independent
When degree is less than first threshold, according to the adjacent character of initial candidate word and initial candidate word in text to be identified formed it is new just
Beginning candidate word;Initial candidate set of words is added in new initial candidate word.
Specifically, when word independence degree is smaller, a possibility that illustrating the initial candidate word separate words, is small, need with
Other character combinations are possible to become independent word.Therefore, first threshold can be set, by the corresponding word of initial candidate word
Language independence degree is compared with first threshold, if it is less than first threshold, then obtains adjacent with the initial candidate in text to be identified
The adjacent character and initial candidate word are formed new initial candidate word, are then added to new initial candidate word by character
In initial candidate set of words, to calculate the word association degree and word independence degree of the new initial candidate word, at the beginning of new
The word generation degree of new initial candidate word is calculated in the word association degree and word independence degree of beginning candidate word, with from initial
Screening obtains target candidate word in candidate collection.In the embodiment of the present invention, when being less than first threshold by word independence degree, continue
It obtains the adjacent character of the initial candidate word and forms new initial candidate word, therefore more acurrate and more neck can be obtained
Domain word.
It in one embodiment, can be with when the word independence degree and/or word association for calculating new initial candidate word are spent
Initial candidate word before adjacent character is added is calculated as a whole.For example, it is assumed that being added before adjacent character
Initial candidate word be " ab ", " ab " corresponding word independence degree is 1, first threshold 2, since " ab " corresponding word is independent
Degree is less than first threshold, therefore can will be added in " ab " in " ab " adjacent character " c ", forms new initial candidate word
" abc ", when the word association of calculating " abc " is spent, as a whole by " ab ", i.e. a character.Therefore, it is calculating
Between " abc " corresponding point when mutual information PMI, can " x " by " ab " as formula (1), " y " by " c " as formula (1).
As shown in figure 5, in one embodiment it is proposed that a kind of text handling method, the present embodiment is mainly in this way
Applied in above-mentioned Fig. 1 terminal 110 or server 120 illustrate.It can specifically include following steps:
Step S502 obtains initial input text.
Specifically, initial input text is to need to carry out text-processing, with the word in review text, obtains correct mesh
The text of mark input text.Initial input text, which can be issued text in the application, also may be at input state
Text, such as the text of the input frame input by application.It is appreciated that when entering specific webpage by browser, it should
Webpage can be considered as the webpage of an application, be webpage version using corresponding webpage.For example, if user is corresponding in application
It makes comments in forum, then it can be using the comment as initial input text.If it is in " financing is logical " the corresponding customer service of webpage
Session interface input inquiry sentence " I will how show interest in in show interest in " when, then will " I will how show interest in in show interest in " make
For initial input text.Wherein, " financing is logical " is the title of financing application.
Step S504 obtains the corresponding incidence relation of the corresponding target domain of initial input text, and incidence relation is field
Incidence relation between word and mapping character, domain term are according to the corresponding text to be identified of target domain, general field text
This set target text set corresponding with target domain identifies.
Specifically, the corresponding target domain of initial input text can be determines according to the source of initial input text, example
As can be belonging to initial input text using corresponding field.For example, if initial input text is in medical APP
It is obtained in (Application, using), then target domain is medical field.The corresponding target domain of initial input text
It can be obtained according to the realm information of input.When needing the text to a certain field to handle, it is defeated to can receive user
The text to be processed entered and the corresponding field of text to be processed.
The corresponding incidence relation of target domain be it is pre-set, preset the corresponding domain term of target domain and mapping
Incidence relation between character, in this way, when needing to be adjusted text, it is available to arrive the corresponding field of the target domain
The incidence relation of word and mapping character carries out initial input text with acquiring the corresponding domain term of initial input text
Amendment.Incidence relation between domain term and mapping character can be the nearly incidence relation of shape, at least one in the nearly incidence relation of sound
Kind.Domain term is identified according to text to be identified, general field text collection and the corresponding target text set of target domain
It arrives.Target candidate word can be obtained according to the proximity relations of the text to be identified of application, according to target candidate word in general neck
The different degree of domain text collection and target text set determines that the target candidate word is the domain term of target domain.The knowledge of domain term
Other method is referred to the text recognition method in above-described embodiment and determines, specifically repeats no more.
Step S506 determines the corresponding target domain word of initial input text according to initial input text and incidence relation.
Specifically, the character in available initial input text, is matched with mapping character, by matched mapping word
Corresponding domain term is accorded with as target domain word.It in one embodiment, can also be right in advance when obtaining initial input text
Initial input text is segmented, and word sequence is obtained, and each word in word sequence is matched with mapping character, is reflected matched
Character is penetrated as target domain word.
Step S508 obtains target input text according to the whole initial input text of target domain tone.
Specifically, after obtaining target domain word, target domain word can be replaced into corresponding character in initial input text,
Obtain target input text.It, can also be to target domain when target domain word is multiple and/or when further including other non-domain terms
Word is screened, and the domain term of initial input text is adjusted.The method of screening for example can be n-gram (n-gram
Grammar) model, such as 2 metagrammar models or 3 metagrammar models etc..
Above-mentioned text handling method, can be according to the pass between the corresponding domain term of target domain and mapping character of application
System determines the corresponding domain term of text inputted in the application, and is adjusted according to text of the domain term to initial input, obtains
To target text.Since domain term is obtained according to the text identification of text to be identified, general field text and target domain
, it is the relevant word of target domain, therefore the accuracy height of the target input text adjusted for specific area, thus
Realization obtains target input text corresponding with target domain under application, and field specific aim and adaptability are high.Further,
In the case where differentiation target domain corresponding domain term, additionally it is possible to which the quantity for reducing word in error correction association vocabulary improves word
The efficiency of language processing.
It in one embodiment, can also be using target input text to target text after obtaining target input text
Set is updated, using target input text as the text in target text set.
In one embodiment, as shown in fig. 6, step S508 is obtained according to the whole initial input text of target domain tone
Target inputs text
Step S602 obtains the corresponding each candidate input word of initial input text.
Specifically, it may include one or more words in initial input text, initial input text can be divided
Word obtains sequence of terms, obtains the corresponding candidate input word of each word, as the corresponding candidate input of initial input text
Word.Word association relationship is pre-set, corresponding candidate input word is obtained according to word association relationship.For example, can be set
" what " " careful " corresponding conjunctive word be.Therefore, available to correspondence if in initial input text including " careful "
Candidate input word " what ".
For example, spelling error correction association dictionary, error correction then can be set if it is needing to carry out error correction to initial input text
The incidence relation of input word and domain term is stored in association dictionary, as shown in table 1, table 2.When initial input text is phonetic notation symbol
When number such as phonetic, then corresponding domain term can be got according to phonetic.In one embodiment, when initial input text includes
When word, word can also be converted to phonetic symbol, then obtain the corresponding domain term of phonetic symbol.For example, it is assumed that initial
Inputting includes " example ability ton " in text, and available " example ability ton " corresponding phonetic is " licaiton ", and can be with according to table 2
Obtaining " licaiton " corresponding target domain word is " financing is logical ".
In one embodiment, initial input text can be segmented, after obtaining corresponding word sequence, obtains word order
The corresponding similar words of each word in column, as the corresponding candidate word of initial input text.Also available each word sequence pair
Then the pinyin sequence answered obtains the corresponding candidate word of each phonetic in pinyin sequence, as the corresponding time of initial input text
Select word.It is appreciated that can also regard similar words and the corresponding candidate word of phonetic as the corresponding candidate of initial input text
Word.
In one embodiment, if to carry out error correction to initial input text, initial input text can be carried out
Errors present detection obtains the corresponding candidate input word of word of errors present.Errors present detection, which can be, utilizes artificial intelligence
Energy machine learning model detection.In one embodiment, the step of errors present detects may include: to calculate initial input text
Transition probability in this between each adjacent word, the position using transition probability lower than preset value is as errors present.Adjacent word
The calculation formula of transition probability between language can be expressed as follows: and P (G | F)=C (FG)/C (F), wherein " F " and " G " is just
Begin to input word adjacent in text, and " F ", before " G ", C (FG) refers to the frequency of occurrence of " FG " in pre-set text set, C
(F) refer to that the frequency of occurrence of " F " in pre-set text set, pre-set text set are the corresponding text collections of target domain, such as
It can be target text set.
In one embodiment, in the application scenarios of target domain, the effect of errors present inspection is simultaneously not so good as general
Under field, because under target domain, it may appear that but still need error correction situation in the text of general field absolutely not problem,
Such as " my handicraft " this in the general field text that there is no problem, correct text should be " I in financial field
Income ", and " skeleton " be in the general field word that there is no problem, correct word should be " stock in financial field
Valence ".It therefore, can be by initial input text when the initial input text of the application to target domain carries out error detection
Each position is used as errors present.
Step S604 constructs word according to the component relationship of the word of initial input text, candidate input word, target domain word
Language relation chain set.
Specifically, word relation chain set includes one or more of word relation chains.Word relation chain be by word according to
The relation chain of secondary connection composition.The component relationship of word refers to sequence and connection relationship in text between word.It is initial defeated
It is fixed for entering the component relationship of word and word in text, it is assumed for example that initial input text is " today is Friday ", then
Word after cutting is " today " "Yes" " Friday " these three words, and the order of connection is also to be followed successively by " today " "Yes" " week
Five ".After when obtaining candidate input word, target domain word, it is also desirable to according to the component relationship of the word of initial input text according to
Secondary connection obtains corresponding word relation chain.Since initial input text can have one or more cutting methods, and after cutting
Word can correspond to one or more candidate input words again, therefore word relation chain can have one or more.
As shown in fig. 7, being below " it is careful product that inner wealth is logical " with initial input text, and initial input text is carried out
For phonetically similar word error correction, the method for obtaining word relation chain is illustrated.Available first " it is careful product that inner wealth is logical "
In the corresponding phonetic of each character, obtain corresponding pinyin sequence be " li, cai, tong, shi, shen, me, chan, pin ",
Using phonetic segmentation algorithm to the pinyin sequence carry out cutting, obtain by " li, cai, tong ", " shi ", " shen, me ",
" chan, pin " composition pinyin sequence and by " li, cai ", " tong, shi ", " shen, me ", " chan, pin " composition
Pinyin sequence.Then candidate input word is obtained according to the table of comparisons of phonetic to candidate word, the table of comparisons of phonetic to candidate word can be with
Including the corresponding phonetic of domain term and the corresponding phonetic of non-domain term.For example, in above-mentioned candidate word, " financing is logical " and " reason
Wealth " is domain term, and other candidate words are non-domain term.After obtaining candidate input word, according to the word of initial input text
Component relationship construct word relation chain, wherein in Fig. 7, word relation chain may include " financing → simultaneously → what → product ",
" financing → colleague → what → product ", " financing lead to → is → what → product " and " financing leads to → when → what → product "
Totally four word relation chains.
It is appreciated that the above phonetically similar word error correction is only a kind of example, it in practical applications, can also be to initial input text
Nearly word form error correction is carried out, or phonetically similar word error correction and nearly word form error correction are carried out to initial input text simultaneously.
Step S606 calculates the transition probability for being transferred to current term in word relation chain by forward direction word.
Specifically, forward direction word is the word being located at before current term in word relation chain.Can be it is whole before
To word, it is also possible to default forward direction word, such as 1 or 2 words, it specifically can be according to used language model
It determines.Transition probability is indicated in the case where there is specific forward direction word, the probability of current term occurs, by word relation chain
It is considered as hidden Markov state chain, what transition probability indicated is the probability that current state is transferred to by the state of forward direction.Transfer is general
Rate can be formulated as p (J Shu I), be indicated under conditions of forward direction word I, the probability that current term J occurs.Word relationship
Include multiple words in chain, using each word of word relation chain as current term, calculates and arrive corresponding transition probability.Example
Such as, if it is 2 metagrammar models are used, then the transition probability that current term is transferred to by preceding 1 word is calculated, if it is use
3 metagrammar models then calculate the transition probability that current term is transferred to by preceding 2 words.
When calculating transition probability, target neck can be obtained by the preceding combination to word and current term as a whole
First number and target that the combination of forward direction word and current term occurs as a whole in the corresponding text collection in domain
It is general to obtain transfer according to first number and second number for second number that forward direction word occurs in the corresponding text collection in field
Rate, for example, p (J Shu I)=Count (IJ)/count (I), wherein Count (IJ) is IJ in the corresponding text collection of target domain
The number of middle appearance, count (I) are the number that I occurs in the corresponding text collection of target domain, and target domain is corresponding
Text collection identical as target text set can also be different, i.e., the corresponding text collection of target domain can have more
It is a, when carrying out text identification, it can use A text collection i.e. target text set and carry out text identification, carrying out at text
When reason, B text collection can be used.
Step S608 obtains the bonding strength of word relation chain according to the corresponding each transition probability of word relation chain.
Specifically, the bonding strength of word relation chain indicates that each word is combined together into sentence in word relation chain
A possibility that, bonding strength is big, and a possibility that becoming sentence is big.The company of word relation chain can be obtained in conjunction with each transition probability
Connect intensity.Such as assume that a word relation chain is " A → B → C → D ", then relation chain intensity is P (ABCD), calculation formula
It can be as shown in formula (5), wherein P (A) indicates that A is the probability of first word of sentence, and P (B | A) is by forward direction word A
It is transferred to the probability of current term B, P (C | B) is the probability that current term C is transferred to by forward direction word B, P (D | C) it is by forward direction
Word C is transferred to the probability of current term D, and P (D) indicates that A is the probability of the last one word of sentence.
P (ABCD)=P (A) * P (B | A) * P (C | B) * P (D | C) * P (D). (5)
In one embodiment, transition probability is calculated according to word frequency of occurrence, for example, P (A) can be equal to C
(A)/C (ALL) can also be equal to C (A ")/C (ALL), and wherein C (A) is that middle A occurs in the corresponding text collection of target domain
Number, C (A ") are the number for first word that A is sentence in the corresponding text collection of target domain.C (ALL) is target
Word frequency of occurrence can be stored in advance in the number of word in the corresponding text collection in field, and online dynamic is led according to target
The variation of the corresponding text collection in domain more neologism frequency of occurrence, to be updated according to the variation of the corresponding text collection of target domain
Relation chain bonding strength.For example, the new text updated in the corresponding application of target domain can be obtained every preset time, calculate
The sum of each word occurs in new text the frequency and word, to realize that word transition probability is more in n-gram model
Newly.In this way, when new application is come into operation, it can be with the accumulation of applicating Chinese sheet, to the initial input text in application
Adjustment it is more accurate.
Step S610 is screened from word relation chain set according to the bonding strength of word relation chain and is obtained target word pass
The corresponding text of target word relation chain is inputted text by tethers.
Specifically, after obtaining word relation chain intensity, it can choose the maximum word relation chain of bonding strength as target
The corresponding text of target word relation chain is inputted text by relation chain.It is of course also possible to select multiple word relationships
Chain is as target word relation chain, for example, using bonding strength according to sorting from large to small the word relation chain for preceding z as mesh
Mark word relation chain.Z is the integer greater than 1, and specific size can according to need setting, for example, 3.
In one embodiment, when calculating the bonding strength of word relation chain, it can be and calculate word relation chain set
In each word relation chain bonding strength, be also possible to the bonding strength of calculating section word relation chain.For example, using dimension
Spy calculates than algorithm.
In one embodiment, it if using 2 metagrammar models, can be obtained using vertebi (Viterbi) algorithm
Take target word relation chain.Assume in viterbi algorithm when entering state i+1 from state i, if from starting point S to state i
The shortest path of each node has been found, then in the shortest path for calculating some nodes X i+1 from starting point S to i+1 state
When diameter, as long as considering the shortest path of the k node all from S to preceding state i, and from this k node respectively to Xi+
1 distance.In embodiments of the present invention, if using viterbi algorithm, can using the word of word relation chain as
State in viterbi algorithm, using transition probability as the corresponding weight in path, the corresponding target of viterbi algorithm is to ask most
Maximum bonding strength is calculated according to the corresponding transition probability of word relation chain in big bonding strength.Therefore, word pass is being calculated
When the bonding strength of tethers, the most Dalian in word relation chain set from word relation chain starting point into each previous node is calculated
Connect intensity, then calculate each previous node to present node transition probability.The corresponding maximum connection of each previous node is strong
Degree is multiplied with corresponding transition probability, obtains the bonding strength from relation chain starting point to present node, and therefrom screening is worked as
The corresponding maximum bonding strength of front nodal point.If there is also next nodes after present node, using next node as current
Node, the method for repeating above-mentioned calculating maximum bonding strength, until the last one node of word relation chain.
With the word relation chain in Fig. 7, and for use viterbi algorithm acquisition relationship by objective (RBO) chain, in Fig. 8, S0 and S1
Respectively indicate " financing ", " financing is logical " corresponding probability.Letter in word relation chain above horizontal line "-" is indicated from horizontal line
The transition probability of the word after word to horizontal line "-" before "-", such as W1 indicate to be transferred to " colleague " by " financing "
Transition probability.So from relation chain starting point to second node " simultaneously ", " colleague " "Yes" and " when " relation chain most
Big bonding strength be s0*w0, s0*w1, s1*w4, s1*w5.Therefore, node " what " is being calculated from starting point to third most
When big bonding strength, s0*w0 is multiplied with W2, s0*w1 is multiplied with W3, s1*w4 is multiplied with W6, s1*w5 is multiplied with W7,
Assuming that obtaining maximum bonding strength is s1*w4*W6, then the available best road by relation chain intensity to " what " node
Diameter is " financing lead to → is ", since the last node of each word relation chain is " product ", available maximum connection
Intensity be s1*w4*W6*W8. target word relation chain be " financing lead to → be → what → product ", therefore target input text be
" it is any product that financing is logical ".
In one embodiment, target word relation chain can also be calculated using 3 yuan or more of syntactic model.When
When using 3 yuan or more of syntactic model, in order to reduce the number for calculating bonding strength, calculating from starting point S to i+1 state
Some nodes X i+1 corresponding bonding strength when, the preceding g bonding strength of available previous state, that is, th state, then benefit
The corresponding bonding strength of i+1 is calculated with preceding g bonding strength and the transition probability of th state to i+1.Wherein, g
Value can according to need determination.
In one embodiment, the step of obtaining initial input text may include: the inquiry for obtaining and inputting in the application
Sentence, using query statement as initial input text;Text handling method can with the following steps are included: obtain inquiry request,
It include that the corresponding target of query statement inputs text in inquiry request;It obtains and the inquiry response number that text obtains is inputted according to target
According to.
Specifically, query statement can be inputs on the corresponding query interface of application, for example, it may be in the application
The input frame input at corresponding session interface is seeked advice from progress customer service.The mode of input can pass through voice or text etc..Such as
Fruit be it is by voice input, then voice can be detected, obtain query statement.After obtaining initial input text, execute
Text handling method provided in an embodiment of the present invention obtains target input text.Inquiry response data are target input texts pair
The answer statement answered.The corresponding inquiry response data of target input text can be pre-set.For example, introduction can be set
It manages money matters and leads to the product introduction text of product.After obtaining target input text, corresponding product introduction text is obtained, as inquiry
Response data.Inquiry request can be to be also possible to service by what the operation of reception user triggered after obtaining target input text
Device automatic trigger.
In one embodiment, inquiry request can be to trigger after obtaining target input text by receiving the operation of user
's.For example, as shown in figure 9, when user inputs in input frame " it is careful product that inner wealth is logical ", terminal or server can be with
Text handling method provided in an embodiment of the present invention is executed, obtains target input text " it is that financing is logical for what product ", terminal obtains
After getting target input text, target input text is shown in the top of input frame, if receiving user to " financing is logical to be
The selection operation of what product " can then send inquiry request to server, and server receives the inquiry request, obtains and corresponds to
Inquiry response data, and return to terminal, terminal show inquiry response data.
In one embodiment, inquiry request can be server automatic trigger.For example, when receiving initial input text
After this, terminal sends initial input text in server, and server executes text-processing side provided in an embodiment of the present invention
Method triggers inquiry request after obtaining target input text, inputs text according to target and obtain corresponding inquiry response data.
In one embodiment, as shown in Figure 10, text handling method can also include:
Step S1002, the corresponding target type of detection target input text.
Specifically, the corresponding type of target input text is obtained from candidate type.Candidate type specifically can root
According to needing to be arranged.It such as may include normal type and abnormal type.Candidate type also may include advertisement type and
Non- advertisement type etc..After obtaining target input text, whether the word that can detecte in target input text includes preset word
Language, if including, using the type as target type.Or target is inputted into text input and is sentenced to preparatory trained type
In other artificial intelligence machine model, corresponding target type is obtained.For example, it is assumed that initial input text be " prediction of bone valence is accurate,
It is benefited high, prestige 123456789 " please be add, corresponding target input text is that " Forecasting of Stock Prices is accurate, income is high, please add wechat
123456789".If may result in initial input text quilt according to the corresponding target type of initial input text detection
Be judged as non-advertisement type, and detected if inputting text according to target, the target type detected be accurately,
For advertisement type.
Step S1004 carried out initial input text when the corresponding type of target input text is preset kind
Filter.
Specifically, filtering can be that the initial input text is shielded on the corresponding display interface of initial input text,
It can be and delete the initial input text etc. in the application, specifically can according to need setting.
As shown in figure 11, below for carrying out error correction to initial input text, to text provided in an embodiment of the present invention
Processing method is illustrated.
1, terminal receives user by the initial input text of the customer service session interface input in application, and by initial input
Text is sent in server.
2, server carries out errors present detection to initial input text, errors present set is obtained, due in general neck
There is no the sentences of mistake in domain, may be mistake using corresponding target domain, therefore all positions can made
For errors present.
3, the corresponding candidate input word of input word that server obtains each errors present according to word association dictionary, obtains
Candidate's input set of words.Wherein, the domain term of word association dictionary, which can be, is obtained by domain term identification module, domain term
Identification module can carry out text identification every preset duration, obtain new domain term, and obtain that new domain term is corresponding to reflect
Character is penetrated, by new domain term and corresponding mapping character associated storage into dictionary.Therefore domain term identification module is supported
Online updating word, the increase for the content that word association dictionary is applied with place update domain term.
4, after server obtains candidate input set of words, n-gram model is formed according to the word of initial input text and is closed
System's building word relation chain.It using n-gram model filters out optimum from word relation chain, such as will be calculated
The maximum target word relation chain of bonding strength is as optimum.The maximum target word relation chain of bonding strength is corresponding
Text inputs text as final error correction result, the corresponding word composition target of target word relation chain.Wherein it is possible to according to answering
The corresponding frequency of occurrences of each word in n-gram model is updated with the variation of middle text, to realize the online of n-gram model
It updates.
6, server inputs text query to corresponding answer statement according to target, and answer statement is returned in terminal.
7, terminal is in the customer service session interface display answer statement.
As shown in figure 12, in one embodiment, a kind of text identification device is provided, text identification device can collect
At can specifically include in above-mentioned server 120 and terminal 110, target candidate word obtains module 1202, set obtains
Module 1204, different degree computing module 1206, the degree of correlation obtain module 1208 and domain term obtains module 1210.
Target candidate word obtains module 1202, for obtaining text to be identified, is obtained according to the character in text to be identified
Target candidate word;
Set obtains module 1204, for obtaining general field text collection and the corresponding target domain of text to be identified
Target text set;
Different degree computing module 1206, for calculate target different degree of the target candidate word in target text set and
In the reference different degree of general field text collection;
The degree of correlation obtains module 1208, based on according to the corresponding target different degree of target candidate word and with reference to different degree
Calculation obtains the target degree of correlation of target candidate word and target domain;
Domain term obtain module 1210, for according to the target degree of correlation using target candidate word as the field of target domain
Word.
In one embodiment, as shown in figure 13, text identification device further include:
Mapping character determining module 1302, for determining the corresponding mapping character of domain term according to mapping relations, mapping is closed
System includes that shape closely maps, sound at least one of closely maps;
Incidence relation establishes module 1304, for establishing the incidence relation between domain term and mapping character.
In one embodiment, target candidate word obtains module 1202 and is used for: according in text to be identified character it is neighbouring
Relationship generates initial candidate set of words;Calculate word of each initial candidate word in target text set in initial candidate set of words
The language degree of association and word independence degree;The word of each initial candidate word is calculated according to word association degree and word independence degree
Language generation degree;It is screened from initial candidate set of words according to the word generation degree of each initial candidate word and obtains target candidate word.
In one embodiment, text identification device further include: word forms module, corresponding for working as initial candidate word
When word independence degree is less than first threshold, according to the adjacent character shape of initial candidate word and initial candidate word in text to be identified
The initial candidate word of Cheng Xin;Module is added, for initial candidate set of words to be added in new initial candidate word.
In one embodiment, the degree of correlation obtains module 1208 and is used for: according to the corresponding target different degree of target candidate word
And the initial degree of correlation of target candidate word and target domain is calculated with reference to different degree;According to target candidate word in target text
Frequency of occurrence in this set determines corresponding degree of correlation confidence level;Mesh is obtained according to the initial degree of correlation and degree of correlation confidence level
Mark the degree of correlation.
As shown in figure 14, in one embodiment, a kind of text processing apparatus is provided, text processing unit can collect
At can specifically include in above-mentioned server 120 and terminal 110, initial input text obtains module 1402, association is closed
System obtains module 1404, target domain word obtains module 1406 and target input text obtains module 1408.
Initial input text obtains module 1402, for obtaining the initial input text for passing through application input;
Incidence relation obtains module 1404, and for obtaining the corresponding incidence relation of target domain of application, incidence relation is
Incidence relation between domain term and mapping character, domain term are text to be identified, the general field text collection according to application
What target text set corresponding with target domain identified;
Target domain word obtains module 1406, for determining initial input text according to initial input text and incidence relation
Corresponding target domain word;
Target input text obtains module 1408, defeated for obtaining target according to the whole initial input text of target domain tone
Enter text.
In one embodiment, target input text obtains module 1408 and is used for: it is corresponding each to obtain initial input text
A candidate's input word;It is closed according to the component relationship of the word of initial input text, candidate input word, target domain word building word
Tethers set;Calculate the transition probability for being transferred to current term in word relation chain by forward direction word;According to word relation chain pair
The each transition probability answered obtains the bonding strength of word relation chain;According to the bonding strength of word relation chain from word relation chain
Screening obtains target word relation chain in set, inputs text for the corresponding text of target word relation chain as target.
In one embodiment, initial input text obtains module and is used for: the query statement by application input is obtained, it will
Query statement is as initial input text;
Text processing apparatus further include: inquiry request module includes inquiry language in inquiry request for obtaining inquiry request
The corresponding target of sentence inputs text;Inquiry response data acquisition module inputs the inquiry that text obtains according to target for obtaining
Response data.
In one embodiment, text processing apparatus further include: target type obtains module, for detecting target input text
This corresponding target type;Filtering module is used for when the corresponding target type of target input text is preset kind, to initial
Input text is filtered.
Figure 15 shows the internal structure chart of computer equipment in one embodiment.The computer equipment specifically can be figure
Terminal 110 in 1.As shown in figure 15, it includes the place connected by system bus which, which includes the computer equipment,
Manage device, memory, network interface, input unit and display screen.Wherein, memory includes non-volatile memory medium and interior storage
Device.The non-volatile memory medium of the computer equipment is stored with operating system, can also be stored with computer program, the computer
When program is executed by processor, processor may make to realize at least one of text recognition method and text handling method side
Method.Computer program can also be stored in the built-in storage, when which is executed by processor, processor may make to hold
At least one method of row text recognition method and text handling method.The display screen of computer equipment can be liquid crystal display
Screen or electric ink display screen, the input unit of computer equipment can be the touch layer covered on display screen, be also possible to
Key, trace ball or the Trackpad being arranged on computer equipment shell can also be external keyboard, Trackpad or mouse etc..
Figure 16 shows the internal structure chart of computer equipment in one embodiment.The computer equipment specifically can be figure
Server 120 in 1.As shown in figure 16, it includes being connected by system bus which, which includes the computer equipment,
Processor, memory and network interface.Wherein, memory includes non-volatile memory medium and built-in storage.The computer
The non-volatile memory medium of equipment is stored with operating system, can also be stored with computer program, and the computer program is processed
When device executes, processor may make to realize at least one of text recognition method and text handling method method.The memory
Computer program can also be stored in reservoir, when which is executed by processor, processor may make to execute text and know
Other at least one of method and text handling method method.
It will be understood by those skilled in the art that structure shown in Figure 15 and 16, only related to application scheme
Part-structure block diagram, do not constitute the restriction for the computer equipment being applied thereon to application scheme, it is specific to count
Calculating machine equipment may include perhaps combining certain components or with different portions than more or fewer components as shown in the figure
Part arrangement.
In one embodiment, text identification device provided by the present application can be implemented as a kind of shape of computer program
Formula, computer program can be run in the computer equipment as shown in Figure 15 and 16.It can be deposited in the memory of computer equipment
Each program module of storage composition text identification device, for example, target candidate word shown in Figure 12 obtains module 1202, set
It obtains module 1204, different degree computing module 1206, the degree of correlation and obtains module 1208 and domain term acquisition module 1210.It is each
The computer program that program module is constituted makes processor execute the text of each embodiment of the application described in this specification
Step in recognition methods.
For example, computer equipment shown in Figure 16 can pass through the target candidate in text identification device as shown in figure 12
Word obtains module 1202 and obtains text to be identified, obtains target candidate word according to the character in text to be identified;It is obtained by set
Modulus block 1204 obtains the target text set of general field text collection and the corresponding target domain of text to be identified;Pass through
Different degree computing module 1206 calculates target different degree of the target candidate word in target text set and in general field text
The reference different degree of this set;Module 1208 is obtained according to the corresponding target different degree of target candidate word and ginseng by the degree of correlation
Examine the target degree of correlation that target candidate word and target domain is calculated in different degree;Module 1210 is obtained by domain term, is used for
According to the target degree of correlation using target candidate word as the domain term of target domain.
In one embodiment, text identification device provided by the present application can be implemented as a kind of shape of computer program
Formula, computer program can be run in the computer equipment as shown in Figure 15 and 16.It can be deposited in the memory of computer equipment
Each program module of storage composition text processing unit, for example, initial input text shown in Figure 14 obtains module 1402, closes
Connection Relation acquisition module 1404, target domain word obtain module 1406 and target input text obtains module 1408.Each journey
The computer program of sequence module composition executes processor at the text of each embodiment of the application described in this specification
Step in reason method.
For example, computer equipment shown in Figure 16 can pass through the initial input in text processing apparatus as shown in figure 14
Text obtains module 1402 and obtains the initial input text inputted by application;The acquisition of module 1404 is obtained by incidence relation to answer
The corresponding incidence relation of target domain, incidence relation of the incidence relation between domain term and mapping character, domain term are
It is identified according to the corresponding target text set of text to be identified, general field text collection and target domain of application;
Module 1406 is obtained by target domain word, and the corresponding mesh of initial input text is determined according to initial input text and incidence relation
Mark domain term;Text is inputted by target to obtain module 1408 according to the whole initial input text of target domain tone to obtain target defeated
Enter text.
In one embodiment it is proposed that a kind of computer equipment, computer equipment include memory, processor and storage
On a memory and the computer program that can run on a processor, processor perform the steps of when executing computer program
Text to be identified is obtained, target candidate word is obtained according to the character in text to be identified;Obtain general field text collection and
The target text set of the corresponding target domain of text to be identified;Calculate target weight of the target candidate word in target text set
Spend and general field text collection reference different degree;According to the corresponding target different degree of target candidate word and reference
The target degree of correlation of target candidate word and target domain is calculated in different degree;According to the target degree of correlation using target candidate word as
The domain term of target domain.
In one embodiment, performed by processor according to the target degree of correlation using target candidate word as target domain
After domain term, computer program also makes following steps performed by processor: determining that domain term is corresponding according to mapping relations
Mapping character, mapping relations include that shape closely maps, sound at least one of closely maps;It establishes between domain term and mapping character
Incidence relation.
In one embodiment, target candidate word packet is obtained according to the character in text to be identified performed by processor
It includes: initial candidate set of words is generated according to the proximity relations of character in text to be identified;It calculates each in initial candidate set of words
Word association degree and word independence degree of the initial candidate word in target text set;It is only according to word association degree and word
It is vertical to spend the word generation degree that each initial candidate word is calculated;It is waited according to the word generation degree of each initial candidate word from initial
Screening in set of words is selected to obtain target candidate word.
In one embodiment, in calculating initial candidate set of words performed by processor each initial candidate word in target
Word association degree in text collection include: determined according to frequency of occurrence of the initial candidate word in target text set it is corresponding
Associated confidence;Determine that the word of initial candidate word initially closes according to probability of occurrence of the initial candidate word in target text set
Connection degree;Word target association degree is calculated according to the corresponding associated confidence of initial candidate word and word initial association degree.
In one embodiment, computer program also makes following steps performed by processor: when initial candidate word is corresponding
Word independence degree be less than first threshold when, according to the adjacent character of initial candidate word and initial candidate word in text to be identified
Form new initial candidate word;Initial candidate set of words is added in new initial candidate word.
In one embodiment, according to the corresponding target different degree of target candidate word and with reference to weight performed by processor
Spend that target candidate word and the target degree of correlation of target domain is calculated includes: important according to the corresponding target of target candidate word
Degree and the initial degree of correlation that target candidate word and target domain are calculated with reference to different degree;According to target candidate word in target
Frequency of occurrence in text collection determines corresponding degree of correlation confidence level;It is obtained according to the initial degree of correlation and degree of correlation confidence level
The target degree of correlation.
In one embodiment it is proposed that a kind of computer equipment, computer equipment include memory, processor and storage
On a memory and the computer program that can run on a processor, processor perform the steps of when executing computer program
Obtain the initial input text by application input;The corresponding incidence relation of target domain of application is obtained, incidence relation is neck
Incidence relation between domain word and mapping character, domain term are led according to text to be identified, general field text collection and target
The corresponding target text set in domain identifies;Determine that initial input text is corresponding according to initial input text and incidence relation
Target domain word;Target input text is obtained according to the whole initial input text of target domain tone.
In one embodiment, to obtain target according to the whole initial input text of target domain tone performed by processor defeated
Entering text includes: to obtain the corresponding each candidate input word of initial input text;According to the composition of the word of initial input text
Relationship, candidate input word, target domain word construct word relation chain set;It calculates and is transferred in word relation chain by forward direction word
The transition probability of current term;The bonding strength of word relation chain is obtained according to the corresponding each transition probability of word relation chain;
It is screened from word relation chain set according to the bonding strength of word relation chain and obtains target word relation chain, target word is closed
The corresponding text of tethers inputs text as target.
In one embodiment, acquisition performed by processor includes: to obtain by the initial input text of application input
By the query statement of application input, using query statement as initial input text;Computer program also holds processor
Row following steps: obtaining inquiry request, includes that the corresponding target of query statement inputs text in inquiry request;It obtains according to target
The inquiry response data that input text obtains.
In one embodiment, computer program also makes following steps performed by processor: detection target input text
Corresponding target type;When the corresponding target type of target input text is preset kind, initial input text was carried out
Filter.
In one embodiment, computer program also makes following steps performed by processor: obtaining the to be identified of application
Text obtains target candidate word according to the character in text to be identified;Obtain general field text collection and and target text
Set;Calculate target different degree of the target candidate word in target text set and the reference weight in general field text collection
It spends;Target candidate word and target domain is calculated according to the corresponding target different degree of target candidate word and with reference to different degree
The target degree of correlation;According to the target degree of correlation using target candidate word as the domain term of target domain.
In one embodiment, according to the target degree of correlation using target candidate word as the domain term of target domain after, meter
Calculation machine program also makes following steps performed by processor: determining the corresponding mapping character of domain term according to mapping relations, maps
Relationship includes that shape closely maps, sound at least one of closely maps;The association established between domain term and corresponding mapping character is closed
System.
In one embodiment, a kind of computer readable storage medium is provided, is stored on computer readable storage medium
Computer program, when computer program is executed by processor, so that processor executes following steps: obtaining text to be identified, root
Target candidate word is obtained according to the character in text to be identified;Obtain general field text collection and the corresponding mesh of text to be identified
The target text set in mark field;Calculate target different degree of the target candidate word in target text set and in general field
The reference different degree of text collection;Target is calculated according to the corresponding target different degree of target candidate word and with reference to different degree
The target degree of correlation of candidate word and target domain;According to the target degree of correlation using target candidate word as the domain term of target domain.
In one embodiment, performed by processor according to the target degree of correlation using target candidate word as target domain
After domain term, computer program also makes following steps performed by processor: determining that domain term is corresponding according to mapping relations
Mapping character, mapping relations include that shape closely maps, sound at least one of closely maps;It establishes between domain term and mapping character
Incidence relation.
In one embodiment, target candidate word packet is obtained according to the character in text to be identified performed by processor
It includes: initial candidate set of words is generated according to the proximity relations of character in text to be identified;It calculates each in initial candidate set of words
Word association degree and word independence degree of the initial candidate word in target text set;It is only according to word association degree and word
It is vertical to spend the word generation degree that each initial candidate word is calculated;It is waited according to the word generation degree of each initial candidate word from initial
Screening in set of words is selected to obtain target candidate word.
In one embodiment, in calculating initial candidate set of words performed by processor each initial candidate word in target
Word association degree in text collection include: determined according to frequency of occurrence of the initial candidate word in target text set it is corresponding
Associated confidence;Determine that the word of initial candidate word initially closes according to probability of occurrence of the initial candidate word in target text set
Connection degree;Word target association degree is calculated according to the corresponding associated confidence of initial candidate word and word initial association degree.
In one embodiment, computer program also makes following steps performed by processor: when initial candidate word is corresponding
Word independence degree be less than first threshold when, according to the adjacent character of initial candidate word and initial candidate word in text to be identified
Form new initial candidate word;Initial candidate set of words is added in new initial candidate word.
In one embodiment, according to the corresponding target different degree of target candidate word and with reference to weight performed by processor
Spend that target candidate word and the target degree of correlation of target domain is calculated includes: important according to the corresponding target of target candidate word
Degree and the initial degree of correlation that target candidate word and target domain are calculated with reference to different degree;According to target candidate word in target
Frequency of occurrence in text collection determines corresponding degree of correlation confidence level;It is obtained according to the initial degree of correlation and degree of correlation confidence level
The target degree of correlation.
In one embodiment, a kind of computer readable storage medium is provided, is stored on computer readable storage medium
Computer program, when computer program is executed by processor, so that processor executes following steps: obtaining through application input
Initial input text;The corresponding incidence relation of target domain of application is obtained, incidence relation is between domain term and mapping character
Incidence relation, domain term is according to text to be identified, general field text collection and the corresponding target text collection of target domain
Close what identification obtained;The corresponding target domain word of initial input text is determined according to initial input text and incidence relation;According to
The whole initial input text of target domain tone obtains target input text.
In one embodiment, to obtain target according to the whole initial input text of target domain tone performed by processor defeated
Entering text includes: to obtain the corresponding each candidate input word of initial input text;According to the composition of the word of initial input text
Relationship, candidate input word, target domain word construct word relation chain set;It calculates and is transferred in word relation chain by forward direction word
The transition probability of current term;The bonding strength of word relation chain is obtained according to the corresponding each transition probability of word relation chain;
It is screened from word relation chain set according to the bonding strength of word relation chain and obtains target word relation chain, target word is closed
The corresponding text of tethers inputs text as target.
In one embodiment, acquisition performed by processor includes: to obtain by the initial input text of application input
By the query statement of application input, using query statement as initial input text;Computer program also holds processor
Row following steps: obtaining inquiry request, includes that the corresponding target of query statement inputs text in inquiry request;It obtains according to target
The inquiry response data that input text obtains.
In one embodiment, computer program also makes following steps performed by processor: detection target input text
Corresponding target type;When the corresponding target type of target input text is preset kind, initial input text was carried out
Filter.
In one embodiment, computer program also makes following steps performed by processor: obtaining the to be identified of application
Text obtains target candidate word according to the character in text to be identified;Obtain general field text collection and and target text
Set;Calculate target different degree of the target candidate word in target text set and the reference weight in general field text collection
It spends;Target candidate word and target domain is calculated according to the corresponding target different degree of target candidate word and with reference to different degree
The target degree of correlation;According to the target degree of correlation using target candidate word as the domain term of target domain.
In one embodiment, according to the target degree of correlation using target candidate word as the domain term of target domain after, meter
Calculation machine program also makes following steps performed by processor:: the corresponding mapping character of domain term is determined according to mapping relations, is mapped
Relationship includes that shape closely maps, sound at least one of closely maps;The association established between domain term and corresponding mapping character is closed
System.
Although should be understood that various embodiments of the present invention flow chart in each step according to arrow instruction successively
It has been shown that, but these steps are not that the inevitable sequence according to arrow instruction successively executes.Unless expressly state otherwise herein,
There is no stringent sequences to limit for the execution of these steps, these steps can execute in other order.Moreover, each embodiment
In at least part step may include that perhaps these sub-steps of multiple stages or stage are not necessarily multiple sub-steps
Completion is executed in synchronization, but can be executed at different times, the execution in these sub-steps or stage sequence is not yet
Necessarily successively carry out, but can be at least part of the sub-step or stage of other steps or other steps in turn
Or it alternately executes.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with
Relevant hardware is instructed to complete by computer program, computer program can be stored in a non-volatile computer and can be read
In storage medium, the computer program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, the application
To any reference of memory, storage, database or other media used in provided each embodiment, may each comprise non-
Volatibility and/or volatile memory.Nonvolatile memory may include read-only memory (ROM), programming ROM (PROM),
Electrically programmable ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include arbitrary access
Memory (RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms, such as static
RAM (SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double data rate sdram (DDRSDRAM), enhanced SDRAM
(ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) directly RAM (RDRAM), straight
Connect memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..
Each technical characteristic of embodiment described above can be combined arbitrarily, for simplicity of description, not to above-mentioned reality
It applies all possible combination of each technical characteristic in example to be all described, as long as however, the combination of these technical characteristics is not deposited
In contradiction, all should be considered as described in this specification.
The embodiments described above only express several embodiments of the present invention, and the description thereof is more specific and detailed, but simultaneously
Limitations on the scope of the patent of the present invention therefore cannot be interpreted as.It should be pointed out that for those of ordinary skill in the art
For, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to guarantor of the invention
Protect range.Therefore, the scope of protection of the patent of the invention shall be subject to the appended claims.
Claims (15)
1. a kind of text recognition method, which comprises
Text to be identified is obtained, target candidate word is obtained according to the character in the text to be identified;
Obtain the target text set of general field text collection and the corresponding target domain of the text to be identified;
Calculate target different degree of the target candidate word in the target text set and in the general field text
The reference different degree of set;
According to the corresponding target different degree of the target candidate word and with reference to different degree be calculated the target candidate word with
The target degree of correlation of the target domain;
According to the target degree of correlation using the target candidate word as the domain term of the target domain.
2. the method according to claim 1, wherein it is described according to the target degree of correlation by the target candidate
After domain term of the word as the target domain, further includes:
The corresponding mapping character of the domain term is determined according to mapping relations, and the mapping relations include that shape closely maps, sound closely reflects
At least one of penetrate;
Establish the incidence relation between the domain term and the mapping character.
3. the method according to claim 1, wherein the character according in the text to be identified obtains mesh
Marking candidate word includes:
Initial candidate set of words is generated according to the proximity relations of character in the text to be identified;
Calculate in the initial candidate set of words word association degree of each initial candidate word in the target text set with
And word independence degree;
It is generated according to the word that each initial candidate word is calculated in the word association degree and the word independence degree
Degree;
It is screened from the initial candidate set of words according to the word generation degree of each initial candidate word and obtains the target
Candidate word.
4. according to the method described in claim 3, it is characterized in that, each initial in the calculating initial candidate set of words
Word association degree of the candidate word in the target text set include:
Corresponding associated confidence is determined according to frequency of occurrence of the initial candidate word in the target text set;
The word of the initial candidate word is determined according to probability of occurrence of the initial candidate word in the target text set
Initial association degree;
Word target association degree is calculated according to the corresponding associated confidence of the initial candidate word and word initial association degree.
5. according to the method described in claim 3, it is characterized in that, the method also includes:
When the corresponding word independence degree of the initial candidate word be less than first threshold when, according to the initial candidate word and it is described just
Adjacent character of the beginning candidate word in the text to be identified forms new initial candidate word;
The initial candidate set of words is added in the new initial candidate word.
6. a kind of text handling method, which comprises
Obtain initial input text;
Obtain the corresponding incidence relation of the corresponding target domain of the initial input text, the incidence relation is domain term and reflect
Penetrate the incidence relation between character, the domain term is according to the corresponding text to be identified of the target domain, general field text
This set target text set corresponding with the target domain identifies;
The corresponding target domain word of the initial input text is determined according to the initial input text and the incidence relation;
Target input text is obtained according to the whole initial input text of the target domain tone.
7. according to the method described in claim 6, it is characterized in that, described whole described initial defeated according to the target domain tone
Enter text obtain target input text include:
Obtain the corresponding each candidate input word of the initial input text;
Word is constructed according to the component relationship of the word of the initial input text, the candidate input word, the target domain word
Language relation chain set;
Calculate the transition probability for being transferred to current term in each word relation chain by forward direction word;
The bonding strength of the word relation chain is obtained according to the corresponding each transition probability of the word relation chain;
It is screened from the word relation chain set according to the bonding strength of the word relation chain and obtains target word relation chain,
Text is inputted using the corresponding text of the target word relation chain as target.
8. according to the method described in claim 6, it is characterized in that, the acquisition initial input text includes:
The query statement inputted in the application is obtained, using the query statement as initial input text;
The method also includes:
Inquiry request is obtained, includes that the corresponding target of the query statement inputs text in the inquiry request;
It obtains and the inquiry response data that text obtains is inputted according to the target.
9. according to the method described in claim 6, it is characterized in that, the method also includes:
The text to be identified is obtained, target candidate word is obtained according to the character in the text to be identified;
Obtain the general field text collection and with the target text set;
Calculate target different degree of the target candidate word in the target text set and in the general field text
The reference different degree of set;
According to the corresponding target different degree of the target candidate word and with reference to different degree be calculated the target candidate word with
The target degree of correlation of the target domain;
According to the target degree of correlation using the target candidate word as the domain term of the target domain.
10. a kind of text identification device, described device include:
Target candidate word obtains module, for obtaining text to be identified, obtains target according to the character in the text to be identified
Candidate word;
Set obtains module, for obtaining the mesh of general field text collection and the corresponding target domain of the text to be identified
Mark text collection;
Different degree computing module, for calculate target different degree of the target candidate word in the target text set and
In the reference different degree of the general field text collection;
The degree of correlation obtains module, for calculating according to the corresponding target different degree of the target candidate word and with reference to different degree
To the target degree of correlation of the target candidate word and the target domain;
Domain term obtain module, for according to the target degree of correlation using the target candidate word as the neck of the target domain
Domain word.
11. device according to claim 10, which is characterized in that described device further include:
Mapping character determining module, for determining the corresponding mapping character of the domain term according to mapping relations, the mapping is closed
System includes that shape closely maps, sound at least one of closely maps;
Incidence relation establishes module, the incidence relation for establishing between the domain term and the mapping character.
12. device according to claim 10, which is characterized in that the target candidate word obtains module and is used for:
Initial candidate set of words is generated according to the proximity relations of character in the text to be identified;
Calculate in the initial candidate set of words word association degree of each initial candidate word in the target text set with
And word independence degree;
It is generated according to the word that each initial candidate word is calculated in the word association degree and the word independence degree
Degree;
It is screened from the initial candidate set of words according to the word generation degree of each initial candidate word and obtains the target
Candidate word.
13. a kind of text processing apparatus, described device include:
Initial input text obtains module, for obtaining initial input text;
Incidence relation obtains module, for obtaining the corresponding incidence relation of the corresponding target domain of the initial input text, institute
Incidence relation of the incidence relation between domain term and mapping character is stated, the domain term is corresponding according to the target domain
What text, general field text collection and the corresponding target text set of the target domain to be identified identified;
Target domain word obtains module, for determining the initial input according to the initial input text and the incidence relation
The corresponding target domain word of text;
Target input text obtains module, defeated for obtaining target according to the whole initial input text of the target domain tone
Enter text.
14. a kind of computer equipment, which is characterized in that including memory and processor, be stored with computer in the memory
Program, when the computer program is executed by the processor, so that the processor perform claim requires any one of 1 to 5
In text handling method described in any one of text recognition method and claim 6 to 9 claim described in claim
The step of at least one method.
15. a kind of computer readable storage medium, which is characterized in that be stored with computer on the computer readable storage medium
Program, when the computer program is executed by processor, so that the processor perform claim requires any one of 1 to 5 right
It is required that in text handling method described in any one of the text recognition method and claim 6 to 9 claim at least
A kind of the step of method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811168737.9A CN110162681B (en) | 2018-10-08 | 2018-10-08 | Text recognition method, text processing method, text recognition device, text processing device, computer equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811168737.9A CN110162681B (en) | 2018-10-08 | 2018-10-08 | Text recognition method, text processing method, text recognition device, text processing device, computer equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110162681A true CN110162681A (en) | 2019-08-23 |
CN110162681B CN110162681B (en) | 2023-04-18 |
Family
ID=67645117
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811168737.9A Active CN110162681B (en) | 2018-10-08 | 2018-10-08 | Text recognition method, text processing method, text recognition device, text processing device, computer equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110162681B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110765996A (en) * | 2019-10-21 | 2020-02-07 | 北京百度网讯科技有限公司 | Text information processing method and device |
CN111552806A (en) * | 2020-04-16 | 2020-08-18 | 重庆大学 | Method for unsupervised construction of entity set in building field |
CN111710328A (en) * | 2020-06-16 | 2020-09-25 | 北京爱医声科技有限公司 | Method, device and medium for selecting training samples of voice recognition model |
CN112016305A (en) * | 2020-09-09 | 2020-12-01 | 平安科技(深圳)有限公司 | Text error correction method, device, equipment and storage medium |
CN112101020A (en) * | 2020-08-27 | 2020-12-18 | 北京百度网讯科技有限公司 | Method, device, equipment and storage medium for training key phrase identification model |
CN113744736A (en) * | 2021-09-08 | 2021-12-03 | 北京声智科技有限公司 | Command word recognition method and device, electronic equipment and storage medium |
CN113743409A (en) * | 2020-08-28 | 2021-12-03 | 北京沃东天骏信息技术有限公司 | Text recognition method and device |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH11238051A (en) * | 1998-02-23 | 1999-08-31 | Toshiba Corp | Chinese input conversion processor, chinese input conversion processing method and recording medium stored with chinese input conversion processing program |
US20040086179A1 (en) * | 2002-11-04 | 2004-05-06 | Yue Ma | Post-processing system and method for correcting machine recognized text |
US20040210434A1 (en) * | 1999-11-05 | 2004-10-21 | Microsoft Corporation | System and iterative method for lexicon, segmentation and language model joint optimization |
CN101572083A (en) * | 2008-04-30 | 2009-11-04 | 富士通株式会社 | Method and device for making up words by using prosodic words |
CN106445906A (en) * | 2015-08-06 | 2017-02-22 | 北京国双科技有限公司 | Generation method and apparatus for medium-and-long phrase in domain lexicon |
CN106681981A (en) * | 2015-11-09 | 2017-05-17 | 北京国双科技有限公司 | Chinese part-of-speech tagging method and device |
CN106708893A (en) * | 2015-11-17 | 2017-05-24 | 华为技术有限公司 | Error correction method and device for search query term |
CN107102746A (en) * | 2016-02-19 | 2017-08-29 | 北京搜狗科技发展有限公司 | Candidate word generation method, device and the device generated for candidate word |
JP2017151804A (en) * | 2016-02-25 | 2017-08-31 | 国立研究開発法人情報通信研究機構 | Automatic translation feature weight optimization device and computer program for the same |
CN108170674A (en) * | 2017-12-27 | 2018-06-15 | 东软集团股份有限公司 | Part-of-speech tagging method and apparatus, program product and storage medium |
-
2018
- 2018-10-08 CN CN201811168737.9A patent/CN110162681B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH11238051A (en) * | 1998-02-23 | 1999-08-31 | Toshiba Corp | Chinese input conversion processor, chinese input conversion processing method and recording medium stored with chinese input conversion processing program |
US20040210434A1 (en) * | 1999-11-05 | 2004-10-21 | Microsoft Corporation | System and iterative method for lexicon, segmentation and language model joint optimization |
US20040086179A1 (en) * | 2002-11-04 | 2004-05-06 | Yue Ma | Post-processing system and method for correcting machine recognized text |
CN101572083A (en) * | 2008-04-30 | 2009-11-04 | 富士通株式会社 | Method and device for making up words by using prosodic words |
CN106445906A (en) * | 2015-08-06 | 2017-02-22 | 北京国双科技有限公司 | Generation method and apparatus for medium-and-long phrase in domain lexicon |
CN106681981A (en) * | 2015-11-09 | 2017-05-17 | 北京国双科技有限公司 | Chinese part-of-speech tagging method and device |
CN106708893A (en) * | 2015-11-17 | 2017-05-24 | 华为技术有限公司 | Error correction method and device for search query term |
CN107102746A (en) * | 2016-02-19 | 2017-08-29 | 北京搜狗科技发展有限公司 | Candidate word generation method, device and the device generated for candidate word |
JP2017151804A (en) * | 2016-02-25 | 2017-08-31 | 国立研究開発法人情報通信研究機構 | Automatic translation feature weight optimization device and computer program for the same |
CN108170674A (en) * | 2017-12-27 | 2018-06-15 | 东软集团股份有限公司 | Part-of-speech tagging method and apparatus, program product and storage medium |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110765996A (en) * | 2019-10-21 | 2020-02-07 | 北京百度网讯科技有限公司 | Text information processing method and device |
CN110765996B (en) * | 2019-10-21 | 2022-07-29 | 北京百度网讯科技有限公司 | Text information processing method and device |
CN111552806A (en) * | 2020-04-16 | 2020-08-18 | 重庆大学 | Method for unsupervised construction of entity set in building field |
CN111710328A (en) * | 2020-06-16 | 2020-09-25 | 北京爱医声科技有限公司 | Method, device and medium for selecting training samples of voice recognition model |
CN111710328B (en) * | 2020-06-16 | 2024-01-12 | 北京爱医声科技有限公司 | Training sample selection method, device and medium for speech recognition model |
CN112101020A (en) * | 2020-08-27 | 2020-12-18 | 北京百度网讯科技有限公司 | Method, device, equipment and storage medium for training key phrase identification model |
CN112101020B (en) * | 2020-08-27 | 2023-08-04 | 北京百度网讯科技有限公司 | Method, apparatus, device and storage medium for training key phrase identification model |
CN113743409A (en) * | 2020-08-28 | 2021-12-03 | 北京沃东天骏信息技术有限公司 | Text recognition method and device |
CN112016305A (en) * | 2020-09-09 | 2020-12-01 | 平安科技(深圳)有限公司 | Text error correction method, device, equipment and storage medium |
CN112016305B (en) * | 2020-09-09 | 2023-03-28 | 平安科技(深圳)有限公司 | Text error correction method, device, equipment and storage medium |
CN113744736A (en) * | 2021-09-08 | 2021-12-03 | 北京声智科技有限公司 | Command word recognition method and device, electronic equipment and storage medium |
CN113744736B (en) * | 2021-09-08 | 2023-12-08 | 北京声智科技有限公司 | Command word recognition method and device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN110162681B (en) | 2023-04-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110162681A (en) | Text identification, text handling method, device, computer equipment and storage medium | |
CN111061856B (en) | Knowledge perception-based news recommendation method | |
JP2021089739A (en) | Question answering method and language model training method, apparatus, device, and storage medium | |
CN109858010B (en) | Method and device for recognizing new words in field, computer equipment and storage medium | |
CN110121705A (en) | Pragmatics principle is applied to the system and method interacted with visual analysis | |
CN107220386A (en) | Information-pushing method and device | |
CN110162519A (en) | Data clearing method | |
CN109684627A (en) | A kind of file classification method and device | |
CN107704512A (en) | Financial product based on social data recommends method, electronic installation and medium | |
CN109087205A (en) | Prediction technique and device, the computer equipment and readable storage medium storing program for executing of public opinion index | |
CN112328909B (en) | Information recommendation method and device, computer equipment and medium | |
CN112035595A (en) | Construction method and device of audit rule engine in medical field and computer equipment | |
CN112926308B (en) | Method, device, equipment, storage medium and program product for matching text | |
CN108932320A (en) | Article search method, apparatus and electronic equipment | |
CN106610932A (en) | Corpus processing method and device and corpus analyzing method and device | |
CN113987182A (en) | Fraud entity identification method, device and related equipment based on security intelligence | |
CN112328869A (en) | User loan willingness prediction method and device and computer system | |
CN115204971A (en) | Product recommendation method and device, electronic equipment and computer-readable storage medium | |
CN113220900B (en) | Modeling Method of Entity Disambiguation Model and Entity Disambiguation Prediction Method | |
CN114265835A (en) | Data analysis method and device based on graph mining and related equipment | |
CN110389963A (en) | The recognition methods of channel effect, device, equipment and storage medium based on big data | |
CN112749238A (en) | Search ranking method and device, electronic equipment and computer-readable storage medium | |
CN106575418A (en) | Suggested keywords | |
CN110008282A (en) | Transaction data synchronization interconnection method, device, computer equipment and storage medium | |
CN113779994B (en) | Element extraction method, element extraction device, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |