CN110162681B - Text recognition method, text processing method, text recognition device, text processing device, computer equipment and storage medium - Google Patents

Text recognition method, text processing method, text recognition device, text processing device, computer equipment and storage medium Download PDF

Info

Publication number
CN110162681B
CN110162681B CN201811168737.9A CN201811168737A CN110162681B CN 110162681 B CN110162681 B CN 110162681B CN 201811168737 A CN201811168737 A CN 201811168737A CN 110162681 B CN110162681 B CN 110162681B
Authority
CN
China
Prior art keywords
word
target
text
candidate
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811168737.9A
Other languages
Chinese (zh)
Other versions
CN110162681A (en
Inventor
黄子轩
王军伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201811168737.9A priority Critical patent/CN110162681B/en
Publication of CN110162681A publication Critical patent/CN110162681A/en
Application granted granted Critical
Publication of CN110162681B publication Critical patent/CN110162681B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a text recognition method, a text processing device, a computer device and a storage medium, wherein the text processing method comprises the following steps: acquiring an initial input text; acquiring an association relation corresponding to a target field corresponding to the initial input text, wherein the association relation is an association relation between a field word and a mapping character, and the field word is obtained by recognition according to a text to be recognized in the target field, a general field text set and a target text set corresponding to the target field; determining a target field word corresponding to the initial input text according to the initial input text and the incidence relation; and adjusting the initial input text according to the target field words to obtain a target input text. The method has the advantage that the target input text obtained by adjusting the method for the specific field has high accuracy.

Description

Text recognition method, text processing method, text recognition device, text processing device, computer equipment and storage medium
Technical Field
The present invention relates to the field of internet, and in particular, to a method and apparatus for text recognition and text processing, a computer device, and a storage medium.
Background
With the rapid development of the internet, the problem of information overload is increasingly prominent. More and more words appear in the network, and in many scenarios, there is a need to adjust information input by a user to information that actually needs to be input, for example, candidate words are displayed according to an input pinyin or words input by the user are corrected.
At present, when information which needs to be input actually is determined according to the information input by a user, the shape and the proximity of words input by the user or words with similar pinyin are generally screened from a word bank, so that the number of the screened words is large, the relevance degree of the screened words and the information actually input by the user is usually not high, and the accuracy is low.
Disclosure of Invention
Therefore, it is necessary to provide a text recognition method, a text processing method, a text recognition device, a text processing device, a computer device, and a storage medium for solving the above-mentioned problems, where the domain words in the target domain can be recognized and obtained according to the text to be recognized, the general domain text set, and the text set corresponding to the target domain corresponding to the text to be recognized, so that the correlation between the recognized domain words and the target domain is large, and the accuracy of text recognition and text processing is high.
A method of text recognition, the method comprising: acquiring a text to be recognized, and obtaining a target candidate word according to characters in the text to be recognized; acquiring a general field text set and a target text set of a target field corresponding to the text to be recognized; calculating the target importance of the target candidate words in the target text set and the reference importance of the target candidate words in the general field text set; calculating a target relevance between the target candidate word and the target field according to a target importance corresponding to the target candidate word and a reference importance; and taking the target candidate word as a field word of the target field according to the target relevance.
In an embodiment, the calculating, according to the target importance and the reference importance corresponding to the target candidate word, the target relevance between the target candidate word and the target field includes: calculating to obtain an initial correlation degree of the target candidate word and the target field according to a target importance degree corresponding to the target candidate word and a reference importance degree; determining a corresponding relevancy confidence degree according to the occurrence frequency of the target candidate words in the target text set; and obtaining the target relevance according to the initial relevance and the relevance confidence.
In one embodiment, the text processing method further includes: detecting a target type corresponding to the target input text; and when the target type corresponding to the target input text is a preset type, filtering the initial input text.
A text recognition apparatus, the apparatus comprising: the target candidate word obtaining module is used for obtaining a text to be recognized and obtaining a target candidate word according to characters in the text to be recognized; the set acquisition module is used for acquiring a general field text set and a target text set of a target field corresponding to the text to be recognized; the importance calculation module is used for calculating the target importance of the target candidate words in the target text set and the reference importance of the target candidate words in the general field text set; the relevancy obtaining module is used for calculating and obtaining the target relevancy between the target candidate word and the target field according to the target importance corresponding to the target candidate word and the reference importance; and the field word acquisition module is used for taking the target candidate word as the field word of the target field according to the target relevance.
A computer device comprising a memory and a processor, the memory having stored therein a computer program which, when executed by the processor, causes the processor to carry out the steps of the text recognition method described above.
A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, causes the processor to carry out the steps of the above-mentioned text recognition method.
The text recognition method, the text recognition device, the computer equipment and the storage medium are provided. When word recognition is needed, obtaining a target candidate word according to characters in a text to be recognized by obtaining the text to be recognized; acquiring a general field text set and a target text set of a target field corresponding to a text to be recognized; calculating the target importance of the target candidate words in the target text set and the reference importance of the target candidate words in the general field text set; calculating a target relevance between the target candidate word and the target field according to the target importance corresponding to the target candidate word and the reference importance; and taking the target candidate words as the domain words of the target domain according to the target relevance. Because the target candidate words are obtained according to the text to be recognized and the importance of the target candidate words in the text set of the target field is compared with the importance of the target candidate words in the text set of the general field, the correlation degree between the target candidate words and the target field can be reflected, the accurate field words related to the target field corresponding to the text to be recognized can be obtained, and the accuracy is high.
A method of text processing, the method comprising: acquiring an initial input text; acquiring an association relation corresponding to a target field corresponding to the initial input text, wherein the association relation is an association relation between a field word and a mapping character, and the field word is obtained by recognition according to a text to be recognized corresponding to the target field, a general field text set and a target text set corresponding to the target field; determining a target field word corresponding to the initial input text according to the initial input text and the incidence relation; and adjusting the initial input text according to the target field words to obtain a target input text.
A text processing apparatus, the apparatus comprising: the initial input text acquisition module is used for acquiring an initial input text; the incidence relation obtaining module is used for obtaining the incidence relation corresponding to a target field corresponding to the initial input text, wherein the incidence relation is the incidence relation between a field word and a mapping character, and the field word is obtained by recognition according to a text to be recognized corresponding to the target field, a general field text set and a target text set corresponding to the target field; the target field word acquisition module is used for determining a target field word corresponding to the initial input text according to the initial input text and the incidence relation; and the target input text obtaining module is used for adjusting the initial input text according to the target field words to obtain a target input text.
A computer device comprising a memory and a processor, the memory having stored therein a computer program which, when executed by the processor, causes the processor to carry out the steps of the text processing method described above.
A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, causes the processor to carry out the steps of the above-mentioned text processing method.
The text processing method, the text processing device, the computer equipment and the storage medium can determine the field words corresponding to the text input in the application according to the relation between the field words corresponding to the target field of the application and the mapping characters, and adjust the initial input text according to the field words to obtain the target input text. The field words are obtained by identifying the text to be identified, the general field text and the text of the target field, and are words related to the target field, so that the accuracy of the target input text obtained by adjusting the specific field is high.
Drawings
Fig. 1 is a diagram of an application environment of a text processing method and a text recognition method provided in an embodiment;
FIG. 2 is a flow diagram of a method of text recognition in one embodiment;
FIG. 3A is a flow diagram of a method for text recognition in one embodiment;
FIG. 3B is a flow diagram that illustrates the establishment of associations between domain words and mapped characters, under an embodiment;
FIG. 4 is a flow diagram illustrating the derivation of target candidate words from characters in a text to be recognized in one embodiment;
FIG. 5 is a flow diagram that illustrates a method for text processing in one embodiment;
FIG. 6 is a flowchart illustrating the process of adjusting an initial input text according to a target domain word to obtain a target input text according to an embodiment;
FIG. 7 is a diagram illustrating a term relationship chain obtained in one embodiment;
FIG. 8 is a diagram illustrating obtaining target input text based on transition probabilities of a word relationship chain in one embodiment;
FIG. 9 is a diagram that illustrates the display of target input text that corresponds to initial input text, in one embodiment;
FIG. 10 is a flow diagram that illustrates a method for text processing in one embodiment;
FIG. 11 is a diagram illustrating error correction performed on an initial input text to obtain a target input text, in accordance with an embodiment;
FIG. 12 is a block diagram showing the construction of a text recognition apparatus according to an embodiment;
FIG. 13 is a block diagram showing the structure of a text recognition apparatus according to an embodiment;
FIG. 14 is a block diagram showing a configuration of a text processing apparatus according to an embodiment;
FIG. 15 is a block diagram showing an internal configuration of a computer device according to one embodiment;
FIG. 16 is a block diagram showing an internal configuration of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
It will be understood that, as used herein, the terms "first," "second," and the like may be used herein to describe various elements, but these elements are not limited by these terms unless otherwise specified. These terms are only used to distinguish one element from another. For example, a first threshold may be referred to as a second threshold, and similarly, a second threshold may be referred to as a first threshold, without departing from the scope of the present application.
Fig. 1 is a diagram of an application environment of a text processing method and a text recognition method provided in an embodiment, as shown in fig. 1, in the application environment, a terminal 110 and a server 120 are included.
In an embodiment, when a domain word needs to be acquired, a user may select a pre-acquired target text set and a general domain text set in the terminal 110, send a text recognition instruction to the server 120 through the terminal 110, where the recognition instruction carries the target text set and the general domain text set, and after receiving the text recognition instruction, the server 120 acquires a text to be recognized, and executes the text recognition method provided in the embodiment of the present invention to obtain the domain word in the target domain. The server 120 stores the domain words of the target domain in a thesaurus.
The domain words are special words of specific domains, and often appear in some specific domains, but rarely appear in other unrelated domains, for example, the domain words may be words whose frequency of appearance in the specific domains is greater than a first preset frequency, whose frequency of appearance in the general domains is less than a second preset frequency, where the first preset frequency and the second preset frequency may be set as needed, and the first preset frequency is greater than the second preset frequency. Vocabularies such as threads, compilers, and the like are domain words in the computer domain, and generally appear in professional articles in the computer domain, while rarely appear in other unrelated domains such as the medical domain. The words such as "financing", and "benefit of this period" are field words in the field of financial management, and generally appear in articles displayed in financial applications. For example, the term "financing" is not a term in the financial field before the application of "financing" is put into the market, but as the application of "financing" is put into the market, the term "financing" is applied more and more frequently in articles in the financial field, and becomes a term in the financial field.
In an embodiment, when a user needs to perform a query, for example, when a user needs to consult a question related to intelligent customer service in a shopping application, the user may input a query statement in the terminal 110, the terminal 110 sends the query statement to the server 120, the server 120 uses the query statement as an initial input text, executes the text processing method provided by the embodiment of the present invention to obtain a target input text, and obtains query response data corresponding to the target input text, returns the query response data to the terminal 110, and the terminal 110 displays the query response data.
It should be understood that the above application environment is only an example, and does not limit the text processing method provided by the embodiment of the present invention. In some embodiments, other application environments may also exist. For example, the terminal may further execute the text processing method provided in the embodiment of the present invention, and the server 120 may further obtain the post in the forum as the initial input text, execute the text processing method provided in the embodiment of the present invention to obtain the target input text, determine whether the post in the forum is an advertisement according to the target input text, and filter the advertisement information when determining that the post in the forum is an advertisement. The server 120 may also pre-store the target text set and the general field text set, acquire a new text to be recognized every preset time, and execute the text recognition method.
The server 120 may be an independent physical server, or may be a server cluster formed by a plurality of physical servers, and may be a cloud server that provides basic cloud computing services such as a cloud server, a cloud database, a cloud storage, and a CDN. The terminal 110 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, and the like. The terminal 110 and the server 120 may be connected via a network.
As shown in fig. 2, in an embodiment, a text recognition method is provided, and this embodiment is mainly illustrated by applying the method to the terminal 110 or the server 120 in fig. 1. The method specifically comprises the following steps:
step S202, a text to be recognized is obtained, and a target candidate word is obtained according to characters in the text to be recognized.
Specifically, the text to be recognized is a text of a field word to be recognized, and the text to be recognized may be obtained according to the recognition instruction. The recognition instruction may be triggered according to a real-time operation of a user or according to a preset trigger condition, where the preset trigger condition may be, for example, setting that the recognition instruction is triggered every preset time to acquire a text to be recognized. The identification instruction may carry one or more of a text to be identified and an identifier corresponding to the text to be identified, for example, the identification instruction may carry a storage location corresponding to the text to be identified, and the text to be identified is obtained according to the storage location. And the target candidate words are obtained according to characters in the text to be recognized. The target candidate word may be a word composed of adjacent characters in the text to be recognized. The number of characters in the target candidate word may be specifically set according to needs, and may be 2 or 3, for example.
In one embodiment, the text to be recognized may be derived from an application corresponding to the target field, and the text to be recognized is obtained according to the content in each page of the application. The content in each page of the application can be one or more of information data published in the application, conversation sentences generated by customer service consultation in the application and information published in a communication forum corresponding to the application. Since the text to be recognized in the application, such as the related financial information published in the financial application, the query sentence input by the user in the financial application, the sentence answered by the customer service, and the like, are generally data having strong correlation with the field of the application, the probability that the target candidate word composed of the characters in the text to be recognized is used as the field word is high, and thus the efficiency of text recognition can be improved.
In one embodiment, each candidate word composed of adjacent characters in the application may be used as a target candidate word, and the target candidate word may also be obtained by further screening the characters of the text to be recognized. For example, the words determined to be the domain words in the domain may be filtered, or the candidate words composed of adjacent characters may be used as the target candidate words after the mood words in the text to be recognized are removed.
In one embodiment, the target candidate word may be generated according to the proximity relation of characters in the text to be recognized, such as grouping adjacent characters into the target candidate word. The number of the target candidate words may be determined as needed, for example, may be 2 or 3. As an actual example, assuming that the text to be recognized is "abcdefg", the target candidate words may be "ab", "bc", and the like, or the target candidate words may be "abc", "bcd", and the like. If "ab" has already been and is confirmed as a domain word in the domain, then "ab" may not be taken as the target candidate word.
In one embodiment, the target candidate word may be further selected from the text to be recognized according to at least one of the word association degree and the word independence degree. For example, a candidate word whose word association degree is higher than a preset value and whose word independence degree is higher than a preset value may be selected as the target candidate word.
Step S204, a general field text set and a target text set of a target field corresponding to the text to be recognized are obtained, wherein the general field text set comprises a general field text.
Specifically, the field to which the text to be recognized belongs is a target field, and the target field to which the text to be recognized belongs may be determined according to the source of the text to be recognized, for example, if the text to be recognized is a financial article in a financial application, the target field to which the text to be recognized belongs is a financial field. The target domain to which the text to be recognized belongs may also be obtained from the input domain information. When it is desired to identify a domain word in a certain domain, the text to be recognized input by the user and the field corresponding to the text to be recognized can be received. The target text set is a text set of a target field, and the target field is a field where the text to be recognized is located. The target field can be a social field, a financial field or a medical field, and the like, and is determined according to needs. The text set of the target field may be imported into the server and obtained by the server from an application corresponding to the target field. The target text can also be obtained from other data sources to form a target text set. The universal field text set is a text set obtained by combining universal field texts, the universal field text refers to a text with weak field pertinence, and compared with a text with strong professional, the universal text does not have a unique application field, but can be universally applied. The universal domain text may be, for example, a text such as news information, and the news information may be crawled on the internet as the universal domain text, or the universal domain text may be manually selected and stored in the server. It is understood that the set of generic domain texts and the set of target texts may be stored in the server executing the text recognition method, or in other servers. One or more of the set of generic domain texts and the set of target texts may also be obtained in real-time. For example, when text recognition is required, text published within 10 days can be acquired in the application to form a target text set. The number of texts in the text set may be set according to needs, for example, the number of texts in the general field text set may be 1 ten thousand, and the number of texts in the target text set may be 100.
In one embodiment, the number of texts in the general field text set is greater than the number of texts in the target text set because the general field text set is composed of texts in the general field, the data sources are rich, and the target text set is relatively less texts in the target field. For example, the ratio of the number of texts in the general field text set and the target text set may be 10. When the number of texts is calculated, one article can be used as one text, each query sentence can be used as one text for a question and answer sentence, or a complete query session can be used as one text, which can be specifically set according to needs.
And step S206, calculating the target importance of the target candidate words in the target text set and the reference importance of the target candidate words in the general field text set.
Specifically, the importance degree is used to indicate the importance degree of a word in the text set, and the greater the importance degree is, the higher the importance degree of the candidate word in the text set is. The target importance refers to the importance of the target candidate word in the target text set, and the reference importance refers to the importance of the target candidate word in the general field text set. The calculation method of the importance can be determined according to the occurrence times of the target candidate words, and if the occurrence times are large, the corresponding importance is large. For example, the number of occurrences may be regarded as the importance, or the frequency of occurrence obtained from the number of occurrences may be regarded as the importance.
In one embodiment, the importance may be determined according to the frequency of occurrence of the target candidate word in the text collection. Is formulated as follows: pg (w) = Cg (w)/Cg (ALL); pf (w) = Cf (w)/Cf (ALL). Wherein w represents a target candidate word, P represents frequency, g represents a general field text set, f represents a target text set, and C represents the number of occurrences. Therefore, pg (w) represents the frequency of occurrence of the target candidate word w in the general field text set g, pf (w) represents the frequency of occurrence of the target candidate word w in the target text set f, cg (w) represents the number of occurrences of the target candidate word w in the general field text set g, and Cf (w) represents the number of occurrences of the target candidate word w in the target text set f. Cg (ALL) represents the number of words in the general domain text set g. Cf (ALL) represents the number of words in the target text set f. The number of the words in the text set can be the number of characters or the number of the words obtained after word segmentation.
And S208, calculating the target relevance between the target candidate word and the target field according to the target importance corresponding to the target candidate word and the reference importance.
Specifically, the target relevance degree indicates the relevance degree of the target candidate word and the target field, and the greater the relevance degree, the more relevant the target candidate word and the target field. The target importance and the target relevance form a positive correlation relationship, and the reference importance and the target relevance form a negative correlation relationship. That is, the target importance increases, the target correlation becomes larger, and the reference importance increases, the target correlation becomes smaller. In one embodiment, the target relevance may be a quotient of the target importance divided by the reference importance.
In one embodiment, the calculating the target relevance between the target candidate word and the target field according to the target importance corresponding to the target candidate word and the reference importance includes: calculating according to the target importance corresponding to the target candidate word and the reference importance to obtain the initial correlation degree of the target candidate word and the target field; determining a corresponding correlation degree confidence coefficient according to the occurrence times of the target candidate words in the target text set; and obtaining the target correlation degree according to the initial correlation degree and the correlation degree confidence degree.
Specifically, the relevance confidence represents the accuracy of the initial relevance, and the relevance confidence is in a positive correlation with the occurrence frequency of the target candidate word in the target text set. When the relevancy is calculated, the occurrence frequency of the target candidate word in the target text set and the occurrence frequency of the target candidate word in the general field text set are both small, so that the obtained target relevancy is high, a confidence coefficient of the relevancy can be obtained according to the occurrence frequency of the target candidate word in the target text set, the relevancy is adjusted, and the obtained target relevancy is high in accuracy. In one embodiment, the calculation method of the target correlation can be formulated as follows: x (w) = Pf (w)/Pg (w) × log 2 (Cf (w)), wherein X (w) represents a target correlation, pf (w)/Pg (w) gives a quotient representing an initial correlation, log 2 Cf (w) represents the correlation confidence. When the target relevance is calculated, the relevance confidence is determined by considering the occurrence frequency of the target candidate words in the target text set, so that more accurate target relevance can be obtained.
And step S210, taking the target candidate words as the domain words of the target domain according to the target relevance.
Specifically, if the target relevance is greater than or equal to the second threshold, the target candidate word may be regarded as a domain word of the target domain, and if the target relevance is less than the second threshold, it may be determined that the target candidate word is not a domain word of the target domain. Or when the target relevance is smaller than a preset threshold, determining whether the target candidate word is a field word of the target field by combining other methods. For example, a target candidate word with a target relevance smaller than a second threshold and larger than a third threshold is displayed on a display interface of the terminal, whether the target candidate word is a domain word of the target domain is determined according to a selection operation of the user on the target candidate word, and if the selection operation is an operation for confirming that the target candidate word is corresponding to the domain word of the target domain, the target candidate word can be used as the domain word of the target domain.
In one embodiment, if there are multiple target candidate words, the target candidate words may be sorted from large to small according to the target relevance, and the top P target candidate words are used as the domain words of the target domain. Or taking the target candidate words with the target relevance larger than the fourth threshold from the top m target candidate words as the field words of the target field. P, m is an integer greater than 1, and the specific values of P, m, the second threshold, the third threshold, and the fourth threshold may be set as needed.
The text recognition method, the text recognition device, the computer equipment and the storage medium are provided. Obtaining a target candidate word according to characters in a text to be recognized by obtaining the text to be recognized; acquiring a general field text set and a target text set of a target field corresponding to a text to be recognized; calculating the target importance of the target candidate words in the target text set and the reference importance of the target candidate words in the general field text set; calculating according to the target importance corresponding to the target candidate word and the reference importance to obtain a target correlation degree between the target candidate word and the target field; and taking the target candidate words as the field words of the target field according to the target relevance. The target candidate words are obtained from the text to be recognized, and the importance of the target candidate words in the text set of the target field is compared with the importance of the target candidate words in the text set of the general field, so that whether the target candidate words are related to the target field can be reflected, accurate field words related to the target field corresponding to the text to be recognized can be obtained, and the accuracy is high.
After the domain words are obtained, the domain words can be stored in a word bank corresponding to the target domain, the domain words stored in the word bank can be used for judging whether the text is the text of the target domain, and the words in the application can be adjusted. Such as correcting the words input by the user with errors into corresponding domain words according to the application domain. In one embodiment, after the user inputs the pinyin characters corresponding to the domain word in the application, the domain word is used as a candidate word corresponding to the pinyin characters, so that the efficiency of inputting the word in the application is improved.
For example, assuming that the user needs to communicate with the artificial intelligence customer service in the finance application, and the input sentence is "how i want to ignore in the same way", the sentence can be corrected to "how i want to manage finance in the finance application", and then the answer sentence corresponding to "how i want to manage finance in the finance application" is obtained, and the answer sentence is output.
In one embodiment, as shown in fig. 3A, after the target candidate word is taken as a domain word of the target domain according to the target relevance, the method further includes:
step S302, determining mapping characters corresponding to the field words according to a mapping relation, wherein the mapping relation comprises at least one of shape-proximity mapping and sound-proximity mapping.
Specifically, the shape-proximity mapping refers to a mapping relationship between characters having similar font structures, and the sound-proximity mapping refers to a mapping relationship between characters having similar phonetic notations. The rules of whether the font structures are similar and whether the phonetic notation is similar can be set according to the requirements. For the shape-near mapping, whether words are similar or not can be determined according to the shape-near word dictionary. For the ZhuYin, it can be set that if one or more of the ZhuYin symbols are the same, different by one ZhuYin symbol, and the two ZhuYin symbols are similar. The mapping character may be one or both of a mapping word and a corresponding ZhuYin symbol. The mapping relation is preset, so that after the domain words are obtained, mapping characters corresponding to the domain words can be obtained according to the mapping relation.
Taking an actual example, assuming that the obtained "financing expert" is a domain word corresponding to the financing domain, and the shape-approximating word of the "reason" can be obtained as "li" according to the shape-approximating mapping relationship, the mapping character corresponding to the "financing expert" may include "li-financing expert". And the phonetic near character of "tong" obtained from the phonetic near mapping relation is "the same" and the corresponding phonetic near phonetic notation character is "ton", therefore, the mapping character corresponding to "financing tong" may include "financing tong" and "liaiton".
And step S304, establishing an incidence relation between the field words and the mapping characters.
Specifically, after the mapping characters are obtained, the domain words and the corresponding mapping characters can be stored in an associated manner, and an association relationship between the domain words and the mapping characters is established. For example, an association word bank may be established, and the association relationship between the domain word and the mapping character may be stored in the word bank.
In one embodiment, the association relationship between the domain words and the mapping characters may be implemented by a word bank, and the association relationship between the domain words and the mapping characters may be stored in an error correction mapping word bank, where the error correction mapping word bank includes a phonetic mapping table and a shape mapping table. FIG. 3B is a flowchart illustrating the process of establishing an association relationship between a domain word and a mapping character in one embodiment. After a new field word set is obtained by using the new word discovery module, a preset shape word dictionary can be used to obtain shape words corresponding to field words, a mapping relation between the shape words and the field words is established, and a shape word mapping table is obtained, wherein the shape word mapping table is shown in table 1. The phonetic notation module can be used to obtain the phonetic near phonetic notation characters corresponding to the new field words, and establish the mapping relationship between the phonetic near characters and the field words to obtain a phonetic near mapping table, which is shown in table 2. Table 1 and table 2 are stored in the error correction mapping thesaurus.
Figure BDA0001821849050000121
In one embodiment, the word stock may further store an association relationship between the non-domain word and the mapping character. For example, assuming "what" is a non-domain word, the mapping table may be as shown in Table 3.
TABLE 3
Phonetic alphabet Vocabulary and phrases
licaitong Financing device
shouyi Gain of
licai Financing
shenme What is
In an embodiment, the association relationship may be dynamically updated online, for example, a text recognition method may be performed at preset time intervals to obtain a domain word, obtain a mapping character corresponding to the domain word, and establish an association relationship between the domain word and the mapping character. In this way, new domain words can be continuously obtained.
In one embodiment, as shown in fig. 4, obtaining the target candidate word according to the characters in the text to be recognized includes:
step S402, generating an initial candidate word set according to the adjacent relation of characters in the text to be recognized.
Specifically, the number of characters in the initial candidate word may be determined as needed, and may be 2 or 3, for example. Each character in the initial candidate word is a neighboring character in the text to be recognized. When the text to be recognized is obtained, a character combination formed by any adjacent characters in the text to be recognized can be used as an initial candidate word. The number of initial candidate words in the initial candidate word set may be determined as needed. For example, character combinations composed of two adjacent characters in the text to be recognized may be used as initial candidate words, or an initial candidate word set may be obtained after words and invalid words, such as a mood word and an auxiliary verb, which have been determined as the target field are removed. In one embodiment, the character combination generated according to the proximity relation of the characters in the text to be recognized may be compared with the words in the dictionary and/or the lexicon, and the words that do not exist in the dictionary and/or the lexicon may be used as the initial candidate words, so that the number of the initial candidate words may be reduced, and the obtained new words may be obtained.
Step S404, calculating the word initial association degree and the word independence degree of each initial candidate word in the initial candidate word set in the target text set.
Specifically, the word association degree is used to indicate the degree of closeness between characters constituting a word. The initial candidate word with high relevance has high occurrence probability in the application. Term independence refers to the degree to which the term is likely to be independent into words. The word independence degree is high, and the possibility that the initial candidate word is a complete word is high. The word association degree and the word independence degree are obtained according to the target text set.
In one embodiment, the word association degree may be represented by PMI (mutual information between points) of the initial candidate words. The PMI measures the correlation between two random variables. The probability of the initial candidate word and the probability of each character of the initial candidate word can be obtained according to the target text set, and the inter-point mutual information can be obtained according to the probability of the initial candidate word and the probability of each character in the target text set. For example, for an initial candidate word consisting of "xy", the inter-point mutual information may be calculated by formula (1), where P (xy) refers to the probability of occurrence of the initial candidate word "xy", P (x) and P (y) refer to the probabilities of occurrence of "x" and "y", respectively, P (y | x) = C (xy)/C (x), P (xy) = P (x) = P (y | x), P (x) = C (x)/C (ALL), and P (y) = C (y)/C (ALL). C (xy), C (x), C (y) refer to the number of times "xy", "x", "y" appear in the target text set. P (y | x) refers to the probability that the next character is "y" in the target text set, subject to the occurrence of "x".
Figure BDA0001821849050000131
In one embodiment, the PMI may be normalized, and the obtained normalized PMI may be used as a word association degree. In one embodiment, the calculation method of the normalized PMI is formulated as follows: n __ PMI =
PMI/H (x) or N _ PMI = PMI/H (y), wherein N __ PMI refers to normalized PMI, H (x) = H (y) = H (x) = H (y)
P (x) × log2P (x), H (x) = P (y) × log2P (y), one of PMI/H (x) and PMI/H (y) may be taken as the normalized PMI, and for example, the smaller value thereof may be taken as the normalized PMI.
In one embodiment, the term independence can be determined from the entropy of the initial candidate words. The entropy of the initial candidate word may be at least one of a left entropy and a right entropy. Entropy is used to represent the amount of information. The left entropy represents the amount of information above the initial candidate word and the right entropy represents the amount of information below the initial candidate word. The left entropy and the right entropy of the initial candidate words reflect the activity degree of the context of the initial candidate words, if the left entropy is high, the collocation objects in the text are rich, and if the right entropy is high, the collocation objects in the text are rich. And the matching objects are rich, which means that the degree of freedom of the initial candidate word is higher, so that the possibility of independent word formation is high. And the entropy is low, which indicates that the collocation object is single and needs to be collocated with fixed characters for use, so that the possibility of independent word formation is low. Wherein, the calculation formulas of the left entropy and the right entropy can be expressed as (2), (3), wherein E L (W) refers to the left entropy of the initial candidate word, E R (W) refers to the right entropy of the initial candidate word. A refers to the set of characters to the left of the initial candidate word in the target text set, a is the character in character set A, and p (aW/a) refers to the case where a appears in the target text setNext, the probability of W occurring, P (Wb/W) refers to the probability of b occurring in the target text set in the event of W occurring.
Figure BDA0001821849050000141
Figure BDA0001821849050000142
In one embodiment, the term independence degree may be determined according to the sum of the left entropy and the right entropy of the initial candidate word. The sum of the left entropy and the right entropy can be formulated as follows: e = E L (W)+E R (W), E may be taken as the word independence of the initial candidate word.
Step S406, calculating the word generation degree of each initial candidate word according to the word association degree and the word independence degree.
Specifically, the word generation degree is used to measure the probability of taking the initial candidate word as a newly generated word. The word association degree and the word generation degree are in positive correlation, and the word independence degree and the word generation degree are in positive correlation. In one embodiment, the corresponding word generation degree mapping value may be obtained according to the word independence degree, and the word generation degree may be obtained according to the word association degree and the word generation degree mapping value.
In one embodiment, calculating the word generation degree of each initial candidate word according to the word association degree and the word independence degree includes: determining corresponding association confidence according to the occurrence times of the initial candidate words in the target text set; determining the initial word association degree of the initial candidate words according to the occurrence probability of the initial candidate words in the target text set; and calculating the word target association degree according to the association confidence degree corresponding to the initial candidate word and the word initial association degree.
Specifically, the relevance confidence reflects the confidence of the calculated word relevance. The word initial association degree can be obtained by referring to the method for calculating the PMI. When the word association degree is calculated, the probability of the initial candidate word is high due to the fact that the total number of words in the text is small, and therefore the calculated word association degree is high, and the degree of confidence of the association degree can be determined according to the number of times of the word occurrence. And after the initial association degree is obtained through calculation, adjusting the initial association degree according to the association degree confidence coefficient to obtain the word target association degree. For example, the term target relevance may be the product of the term initial relevance and the relevance confidence.
In one embodiment, the calculation formula of the word generation degree may be represented by formula (4), where U (W) represents the word generation degree of the initial candidate word W, N _ PMI (W) represents the word association degree of the initial candidate word W, and C (W) represents the number of occurrences of the initial candidate word W in the target text set. h (W) represents a penalty value corresponding to the word independence degree, wherein the penalty value of the word independence degree is determined according to a range corresponding to the word independence degree. A range with a large value corresponds to a lower penalty value than a range with a small value. For example, a penalty of 3 can be set when the word independence is less than 1. It can be set that the penalty value is 0 when the word independence degree is greater than 1 and equal to 1.
U(W)=N_PMI(W)*log(C(W))-h(W) (4)
And step S408, screening the initial candidate word set according to the word generation degree of each initial candidate word to obtain a target candidate word.
In one embodiment, the initial candidate words with the word generation degrees larger than the fifth threshold may be used as the target candidate words, or the word generation degrees in the initial candidate word set may be sorted in the order from the largest word generation degree to the smallest word generation degree, and the first d initial candidate words are used as the target candidate words. Or, taking the initial candidate words with the word independence degree larger than a sixth threshold from the first e initial candidate words as target candidate words. Specific numerical values of d, e, the fifth threshold and the sixth threshold can be set as required.
In one embodiment, a text to be recognized is obtained from a target text set, a target candidate word is obtained through screening of word association degree and word independence degree, a new word newly generated in the target text set can be obtained, and whether the new word is a field word is judged, so that the new field word can be obtained according to the target text set.
In one embodiment, the text recognition method further comprises the steps of: when the word independence degree corresponding to the initial candidate word is smaller than a first threshold value, forming a new initial candidate word according to the initial candidate word and adjacent characters of the initial candidate word in the text to be recognized; and adding the new initial candidate word into the initial candidate word set.
Specifically, when the degree of word independence is small, it means that the initial candidate word is less likely to become an independent word independently, and it is likely to become an independent word by combining with other characters. Therefore, a first threshold may be set, the term independence degree corresponding to the initial candidate word is compared with the first threshold, if the first threshold is smaller than the first threshold, a character adjacent to the initial candidate word in the text to be recognized is obtained, the adjacent character and the initial candidate word form a new initial candidate word, then the new initial candidate word is added to the initial candidate word set to calculate the term association degree and the term independence degree of the new initial candidate word, and the term generation degree of the new initial candidate word is calculated according to the term association degree and the term independence degree of the new initial candidate word to obtain the target candidate word from the initial candidate set by screening. In the embodiment of the invention, when the word independence degree is smaller than the first threshold, the characters adjacent to the initial candidate word are continuously acquired to form a new initial candidate word, so that more accurate and more field words can be acquired.
In one embodiment, when calculating the word independence degree and/or the word association degree of the new initial candidate word, the initial candidate word before adding the adjacent character may be calculated as a whole. For example, assuming that the initial candidate word before adding the adjacent character is "ab", the degree of independence of the word corresponding to "ab" is 1, and the first threshold is 2, since the degree of independence of the word corresponding to "ab" is smaller than the first threshold, the character "c" adjacent to "ab" can be added to "ab" to form a new initial candidate word "abc", and when calculating the degree of association of the word of "abc", the character "ab" is taken as a whole, that is, one character. Therefore, when the inter-point mutual information PMI corresponding to "abc" is calculated, "ab" may be regarded as "x" of the formula (1) and "c" may be regarded as "y" of the formula (1).
As shown in fig. 5, in an embodiment, a text processing method is proposed, and this embodiment is mainly illustrated by applying the method to the terminal 110 or the server 120 in fig. 1. The method specifically comprises the following steps:
step S502, obtaining an initial input text.
Specifically, the initial input text is a text that requires text processing to correct words in the text to obtain a correct target input text. The initial input text may be text that has been published in the application or may be in an input state, such as text entered through an input box of the application. It is understood that when a specific web page is entered through the browser, the web page may be regarded as a web page of an application, which is a web page corresponding to a web page version of the application. For example, if a user posts a comment in a forum to which the application corresponds, the comment may be taken as the initial input text. If the customer service session interface corresponding to the 'financing expert' webpage inputs an inquiry sentence 'I want to ignore in the same channel', then 'I want to ignore in the same channel' is used as an initial input text. Wherein, the financing application is the name of the financing application.
Step S504, acquiring an association relation corresponding to a target field corresponding to the initial input text, wherein the association relation is an association relation between a field word and a mapping character, and the field word is obtained by recognition according to a text to be recognized corresponding to the target field, a general field text set and a target text set corresponding to the target field.
Specifically, the target domain corresponding to the initial input text may be determined according to the source of the initial input text, and may be, for example, a domain corresponding to an application to which the initial input text belongs. For example, if the initial input text is obtained in a medical APP (Application), the target domain is the medical domain. The target domain corresponding to the initial input text can also be obtained according to the input domain information. When the text in a certain field needs to be processed, the text to be processed input by the user and the field corresponding to the text to be processed can be received.
The incidence relation corresponding to the target field is preset, and the incidence relation between the field word corresponding to the target field and the mapping character is preset, so that when the text needs to be adjusted, the incidence relation between the field word corresponding to the target field and the mapping character can be obtained, the field word corresponding to the initial input text can be obtained, and the initial input text can be corrected. The association relationship between the domain words and the mapping characters can be at least one of a shape-proximity association relationship and a sound-proximity association relationship. The field words are obtained by recognition according to the text to be recognized, the general field text set and the target text set corresponding to the target field. The target candidate word can be obtained according to the proximity relation of the applied text to be recognized, and the target candidate word is determined to be the field word of the target field according to the importance of the target candidate word in the general field text set and the target text set. The method for recognizing the domain word may be determined by referring to the text recognition method in the foregoing embodiment, and details are not repeated.
And S506, determining a target field word corresponding to the initial input text according to the initial input text and the association relation.
Specifically, the characters in the initial input text may be obtained, the characters may be matched with the mapping characters, and the domain word corresponding to the matched mapping character may be used as the target domain word. In one embodiment, when the initial input text is obtained, the initial input text may be subjected to word segmentation in advance to obtain a word sequence, each word in the word sequence is matched with a mapping character, and the matched mapping character is used as a target domain word.
And step S508, adjusting the initial input text according to the target field words to obtain a target input text.
Specifically, after the target field word is obtained, the target field word may replace a corresponding character in the initial input text to obtain the target input text. When the target field words are multiple and/or comprise other non-field words, the target field words can be screened to obtain the field words for adjusting the initial input text. The screening method may be, for example, an n-gram model, such as a 2-gram model or a 3-gram model.
According to the text processing method, the field words corresponding to the text input in the application can be determined according to the relation between the field words corresponding to the target field of the application and the mapping characters, and the initially input text is adjusted according to the field words to obtain the target text. Because the field words are words related to the target field and are obtained by recognition of the text to be recognized, the general field text and the text of the target field, the accuracy of the target input text obtained by adjustment for the specific field is high, the target input text corresponding to the target field is obtained under application, and the field pertinence and the adaptability are high. Furthermore, under the condition of distinguishing the field words corresponding to the target field, the number of words in the error correction associated word list can be reduced, and the efficiency of word processing is improved.
In an embodiment, after the target input text is obtained, the target text set may be updated by using the target input text, and the target input text is used as a text in the target text set.
In one embodiment, as shown in fig. 6, the step S508 of adjusting the initial input text according to the target domain word to obtain the target input text includes:
step S602, obtaining each candidate input word corresponding to the initial input text.
Specifically, the initial input text may include one or more words, and the initial input text may be subjected to word segmentation to obtain a word sequence, and candidate input words corresponding to the words are obtained and serve as the candidate input words corresponding to the initial input text. And presetting a word association relation, and obtaining corresponding candidate input words according to the word association relation. For example, it is possible to set "what" to which "the relevant word corresponds. Therefore, if "how to examine" is included in the initial input text, the corresponding candidate input word "what" can be acquired.
For example, if the initial input text needs to be corrected, a spelling correction related word bank may be set, and the correction related word bank stores the association relationship between the input word and the domain word, as shown in table 1 and table 2. When the initial input text is a phonetic symbol such as pinyin, the corresponding field word can be obtained according to the pinyin. In one embodiment, when the initially input text includes a word, the word may be converted into a ZhuYin symbol, and then a domain word corresponding to the ZhuYin symbol is obtained. For example, assuming that the initial input text includes "talent ton", the pinyin corresponding to "talent ton" may be obtained as "licaiton", and the target domain word corresponding to "licaiton" may be obtained as "financing based on table 2.
In one embodiment, the initial input text may be segmented to obtain corresponding word sequences, and then, the shape-similar words corresponding to each word in the word sequences are obtained as candidate words corresponding to the initial input text. Or obtaining a pinyin sequence corresponding to each word sequence, and then obtaining a candidate word corresponding to each pinyin in the pinyin sequence as a candidate word corresponding to the initial input text. It can be understood that both the shape word and the candidate word corresponding to the pinyin can be used as candidate words corresponding to the initial input text.
In one embodiment, if the initial input text is to be corrected, the initial input text may be subjected to error position detection, and candidate input words corresponding to the words in the error positions are obtained. The false location detection may be detected using an artificial intelligence machine learning model. In one embodiment, the step of error location detection may comprise: and calculating the transition probability between each adjacent word in the initial input text, and taking the position with the transition probability lower than a preset value as an error position. The calculation formula of the transition probability between adjacent words can be expressed as follows: p (G | F) = C (FG)/C (F), where "F" and "G" are adjacent words in the initial input text, and "F" precedes "G", C (FG) refers to the number of occurrences of "FG" in a preset text set, C (F) refers to the number of occurrences of "F" in a preset text set, and the preset text set is a text set corresponding to the target field, and may be the target text set, for example.
In one embodiment, in the application scenario of the target domain, the error location check is not as effective as the general domain, because in the target domain, there may be text that has no problem in the general domain, but still needs error correction, for example, "my handcraft" which is a text that has no problem in the general domain, the correct text in the financial domain should be "my income", the "skeleton" is a word that has no problem in the general domain, and the correct word in the financial domain should be "stock price". Therefore, when the initial input text of the application of the target domain is error-detected, each position of the initial input text can be regarded as an error position.
Step S604, a word relation chain set is constructed according to the composition relation of the words of the initial input text, the candidate input words and the target field words.
In particular, the set of word relationship chains includes one or more word relationship chains. The term relation chain is a relation chain formed by connecting terms in sequence. The composition relationship of the words refers to the sequence and connection relationship between the words in the text. The composition relationship between the words and the initial input text is fixed, for example, if the initial input text is "today is friday", the segmented words are three words of "today", "is", "friday", and the connection sequence is also "today", "is" and "friday" in sequence. After the candidate input words and the target field words are obtained, the candidate input words and the target field words need to be connected in sequence according to the composition relationship of the words of the initial input text to obtain corresponding word relationship chains. Because the initial input text may have one or more segmentation methods, and the segmented words may correspond to one or more candidate input words, there may be one or more word relationship chains.
As shown in fig. 7, a method of obtaining a word relationship chain will be described below by taking an example in which the initial input text is "the product is a trial and error", and homophonic correction is performed on the initial input text. Firstly, the pinyin corresponding to each character in ' Likuitong is a product to be examined ' can be obtained, the corresponding pinyin sequence is ' li, cai, tong, shi, shen, me, chan and pin ', the pinyin sequence is split by using a pinyin splitting algorithm, and the pinyin sequence consisting of ' li, cai, tong ', shi ', shen, me ' and chan and the pinyin sequence consisting of ' li, cai ', tong, shi ', shen, me ' and chan and pin ' are obtained. And then obtaining a candidate input word according to the comparison table from the pinyin to the candidate word, wherein the comparison table from the pinyin to the candidate word can comprise the pinyin corresponding to the field word and the pinyin corresponding to the non-field word. For example, among the above candidate words, "financing" and "financing" are domain words, and the other candidate words are non-domain words. After the candidate input words are obtained, a word relation chain is constructed according to the composition relation of the words of the initial input text, wherein in fig. 7, the word relation chain may include four word relation chains of "financing → simultaneously → what → product", "financing → colleague → what → product", "financing → time → what → product".
It will be appreciated that the above homonym correction is only an example, and in practical applications, the homonym correction may be performed on the initial input text, or both the homonym correction and the homonym correction may be performed on the initial input text.
Step S606, calculating the transition probability of the forward word to the current word in the word relation chain.
Specifically, a forward word is a word that precedes the current word in the word relationship chain. The forward words may be all forward words, or a preset number of forward words, for example, 1 or 2 forward words, which may be determined according to the language model used. The transition probability represents the probability of the current word occurring in the case of a specific forward word, and the word relationship chain is regarded as a hidden Markov state chain, and the transition probability represents the probability of transition from the forward state to the current state. The transition probability may be formulated as p (J I), representing the probability of the current word J occurring under the condition of the forward word I. And the word relation chain comprises a plurality of words, and each word of the word relation chain is used as a current word and the corresponding transition probability is calculated. For example, if a 2-gram model is used, the transition probability of the first 1 word to the current word is calculated, and if a 3-gram model is used, the transition probability of the first 2 words to the current word is calculated.
When the transition probability is calculated, the combination of the forward word and the current word may be taken as a whole, a first number of occurrences of the combination of the forward word and the current word in the text set corresponding to the target field as a whole and a second number of occurrences of the forward word in the text set corresponding to the target field are obtained, and the transition probability is obtained according to the first number and the second number, for example, p (J | I) = Count (IJ)/Count (I), where Count (IJ) is the number of occurrences of IJ in the text set corresponding to the target field, count (I) is the number of occurrences of I in the text set corresponding to the target field, and the text set corresponding to the target field may be the same as or different from the target text set, that is, there may be a plurality of text sets corresponding to the target field.
Step S608, obtaining the connection strength of the term relation chain according to each transition probability corresponding to the term relation chain.
Specifically, the connection strength of the term relationship chain indicates the possibility that the individual terms in the term relationship chain are combined together to form a sentence, and the connection strength is high, and the possibility of forming a sentence is high. The connection strength of the word relationship chain can be obtained by combining the transition probabilities. For example, assuming a word relationship chain is "a → B → C → D", the strength of the relationship chain is P (ABCD), and the calculation formula can be as shown in formula (5), where P (a) represents the probability that a is the first word of a sentence, P (B | a) is the probability of transition from the forward word a to the current word B, P (C | B) is the probability of transition from the forward word B to the current word C, P (D | C) is the probability of transition from the forward word C to the current word D, and P (D) represents the probability that a is the last word of a sentence.
P(ABCD)=P(A)*P(B|A)*P(C|B)*P(D|C)*P(D)。 (5)
In one embodiment, the transition probability is calculated according to the number of occurrences of the word, for example, P (a) may be equal to C (a)/C (ALL), or may be equal to C (a ")/C (ALL), where C (a) is the number of occurrences of a in the text set corresponding to the target domain, and C (a") is the number of occurrences of a being the first word of a sentence in the text set corresponding to the target domain. C (ALL) is the number of words in the text set corresponding to the target field, the number of times of occurrence of the words can be stored in advance, and the number of times of occurrence of the words is updated dynamically on line according to the change of the text set corresponding to the target field, so that the connection strength of the relationship chain is updated according to the change of the text set corresponding to the target field. For example, new texts updated in the application corresponding to the target field may be obtained at preset time intervals, and the frequency of occurrence of each word and the total number of words in the new texts may be calculated, so as to update the word transition probability in the n-gram model. Thus, when a new application is put into use, the initial input text in the application can be adjusted more accurately along with the accumulation of the text in the application.
And step S610, screening the word relation chain set according to the connection strength of the word relation chain to obtain a target word relation chain, and taking a text corresponding to the target word relation chain as a target input text.
Specifically, after the strength of the word relationship chain is obtained, the word relationship chain with the maximum connection strength may be selected as a target relationship chain, and a text corresponding to the target word relationship chain is used as a target input text. Of course, a plurality of term relationship chains may be selected as the target term relationship chain, for example, a term relationship chain in which the connection strengths are ranked from large to small as the top z is used as the target term relationship chain. Z is an integer greater than 1, and the specific size can be set as required, for example, 3.
In one embodiment, when the connection strength of the term relation chain is calculated, the connection strength of each term relation chain in the term relation chain set may be calculated, or the connection strength of a part of the term relation chain may be calculated. For example, the calculation is performed using the viterbi algorithm.
In one embodiment, if a 2-gram model is used, a vertebi (viterbi) algorithm may be used to obtain the target word relationship chain. In the viterbi algorithm, it is assumed that when entering the state i +1 from the state i, if the shortest path from the starting point S to each node in the state i has been found, then when calculating the shortest path from the starting point S to a certain node Xi +1 in the state i +1, the shortest paths from S to all k nodes in the previous state i and the distances from the k nodes to Xi +1, respectively, need only be considered. In the embodiment of the invention, if the viterbi algorithm is adopted, the words of the word relation chain can be used as the states in the viterbi algorithm, the transition probabilities are used as the weights corresponding to the paths, the corresponding target of the viterbi algorithm is to obtain the maximum connection strength, and the maximum connection strength is obtained through calculation according to the transition probabilities corresponding to the word relation chain. Therefore, when the connection strength of the term relation chain is calculated, the maximum connection strength from the starting point of the term relation chain to each previous node in the term relation chain set is calculated, and then the transition probability from each previous node to the current node is calculated. And multiplying the maximum connection strength corresponding to each previous node by the corresponding transition probability to obtain the connection strength from the starting point of the relationship chain to the current node, and screening to obtain the maximum connection strength corresponding to the current node. If the next node exists after the current node, the next node is taken as the current node, and the method for calculating the maximum connection strength is repeated until the last node of the word relation chain.
Taking the term relationship chain in fig. 7 and taking the viterbi algorithm to obtain the target relationship chain as an example, in fig. 8, S0 and S1 respectively represent probabilities corresponding to "financing" and "financing-communication". Letters above the horizontal line "-" in the word relationship chain indicate transition probabilities from words before the horizontal line "-" to words after the horizontal line "-", and for example, W1 indicates transition probabilities from "financing" to "colleagues". Then the maximum connection strength of the relationship chain from the relationship chain start point to the second node "at the same time", "co-worker", "yes" and "at" is s0 w0, s0 w1, s1 w4, s1 w5. Therefore, when calculating the maximum connection strength from the starting point to the third node "what", s 0W 0 is multiplied by W2, s 0W 1 is multiplied by W3, s 1W 4 is multiplied by W6, s 1W 5 is multiplied by W7, and assuming that the maximum connection strength is obtained as s 1W 4W 6, the optimal path from the relationship chain strength to "what" node can be obtained as "financing through → yes", and since the last node of each relationship chain is "product", the maximum connection strength can be obtained as s 1W 4W 6 W8., the target word "what financing through → product" is the target word "what → product", and the target input text is "what financing through product".
In one embodiment, the target word relationship chain can be calculated by using a 3-element or more grammar model. When a 3-tuple or more grammar model is used, in order to reduce the number of times of calculating the connection strength, when the connection strength corresponding to a certain node Xi +1 from the starting point S to the i +1 th state is calculated, the first g connection strengths of the previous state, i.e., the i th state, can be obtained, and then the connection strength corresponding to the i +1 th state is calculated by using the first g connection strengths and the transition probability from the i th state to the i +1 th state. Wherein, the value of g can be determined according to the requirement.
In one embodiment, the step of obtaining the initial input text may comprise: acquiring a query sentence input in an application, and taking the query sentence as an initial input text; the text processing method may further include the steps of: acquiring a query request, wherein the query request comprises a target input text corresponding to a query statement; and acquiring query response data obtained according to the target input text.
Specifically, the query statement may be input on a query interface corresponding to the application, for example, may be input in an input box of a session interface corresponding to a customer service consultation in the application. The input mode can be voice or text, etc. If the query is input through voice, the voice can be detected to obtain a query statement. After the initial input text is obtained, the text processing method provided by the embodiment of the invention is executed to obtain the target input text. The query response data is an answer sentence corresponding to the target input text. The query response data corresponding to the target input text may be preset. For example, a product introduction text may be provided that introduces a financing product. And when the target input text is obtained, acquiring a corresponding product introduction text as query response data. The query request can be triggered by receiving the operation of the user after the target input text is obtained or can be automatically triggered by the server.
In one embodiment, the query request may be triggered by receiving a user action after obtaining the target input text. For example, as shown in fig. 9, when the user inputs "money and money is a trial product" in the input box, the terminal or the server may execute the text processing method provided in the embodiment of the present invention, obtain a target input text "what product the money and money is", after the terminal obtains the target input text, display the target input text above the input box, if a selection operation of the user on "what product the money and money is" is received, send an inquiry request to the server, the server receives the inquiry request, obtains corresponding inquiry response data, and returns the inquiry response data to the terminal, and the terminal displays the inquiry response data.
In one embodiment, the query request may be automatically triggered by the server. For example, after receiving the initial input text, the terminal sends the initial input text to the server, and the server executes the text processing method provided by the embodiment of the invention to obtain the target input text, then triggers the query request, and obtains the corresponding query response data according to the target input text.
In one embodiment, as shown in fig. 10, the text processing method may further include:
step S1002, a target type corresponding to the target input text is detected.
Specifically, the type corresponding to the target input text is obtained from the candidate types. The candidate types may be specifically set as needed. For example, may include normal types as well as abnormal types. The candidate types may also include advertisement types, non-advertisement types, and the like. After the target input text is obtained, whether the words in the target input text include preset words or not can be detected, and if yes, the type is taken as the target type. Or inputting the target input text into a pre-trained type discrimination artificial intelligence machine model to obtain the corresponding target type. For example, assume that the initial input text is "bone price prediction is accurate and benefit is high, please add wesson 123456789", and the corresponding target input text is "share price prediction is accurate and benefit is high, please add wesson 123456789". If the corresponding target type is detected according to the initial input text, the initial input text may be judged as a non-advertisement type, and if the detection is performed according to the target input text, the detected target type is accurate and is an advertisement type.
Step S1004, when the type corresponding to the target input text is a preset type, filtering the initial input text.
Specifically, the filtering may be to shield the initial input text on a display interface corresponding to the initial input text, or to delete the initial input text in the application, and may be specifically set as needed.
As shown in fig. 11, the following describes a text processing method according to an embodiment of the present invention by taking error correction of an initial input text as an example.
1. The terminal receives an initial input text input by a user through a customer service session interface in the application and sends the initial input text to the server.
2. The server detects the error position of the initial input text to obtain an error position set, and all positions can be used as error positions because no error sentence exists in the general field and the corresponding target field is possibly wrong in the application.
3. And the server acquires candidate input words corresponding to the input words of each error position according to the word association word bank to obtain a candidate input word set. The field words of the word association word bank can be obtained through a field word recognition module, the field word recognition module can perform text recognition every other preset time to obtain new field words, obtain mapping characters corresponding to the new field words, and store the new field words and the corresponding mapping characters in the word bank in an association manner. Therefore, the domain word recognition module supports online word updating, so that the domain words can be updated along with the increase of the content of the application in which the domain words are located in the word association word bank.
4. And after the server obtains the candidate input word set, the n-element grammar model constructs a word relation chain according to the word composition relation of the initial input text. And screening out the best result from the word relation chain by using an n-element grammar model, and taking the target word relation chain with the maximum connection strength obtained by calculation as the best result. And taking the text corresponding to the target word relation chain with the maximum connection strength as a final error correction result, wherein the words corresponding to the target word relation chain form a target input text. The occurrence frequency corresponding to each word in the n-gram model can be updated according to the change of the text in the application, so that the on-line updating of the n-gram model is realized.
6. The server inquires corresponding answer sentences according to the target input text and returns the answer sentences to the terminal.
7. And the terminal displays the answer sentence on the customer service session interface.
As shown in fig. 12, in an embodiment, a text recognition apparatus is provided, which may be integrated in the server 120 and the terminal 110, and specifically may include a target candidate word obtaining module 1202, a set obtaining module 1204, an importance calculating module 1206, a relevance obtaining module 1208, and a domain word obtaining module 1210.
A target candidate word obtaining module 1202, configured to obtain a text to be recognized, and obtain a target candidate word according to characters in the text to be recognized;
a set obtaining module 1204, configured to obtain a general field text set and a target text set of a target field corresponding to a text to be recognized;
the importance calculating module 1206 is used for calculating the target importance of the target candidate words in the target text set and the reference importance of the target candidate words in the general field text set;
a relevancy obtaining module 1208, configured to calculate a target relevancy between the target candidate word and the target field according to the target importance corresponding to the target candidate word and the reference importance;
and a domain word obtaining module 1210, configured to use the target candidate word as a domain word of the target domain according to the target relevance.
In one embodiment, as shown in fig. 13, the text recognition apparatus further includes:
the mapping character determining module 1302 is configured to determine a mapping character corresponding to the domain word according to a mapping relationship, where the mapping relationship includes at least one of a shape-near mapping and a sound-near mapping;
and an association relationship establishing module 1304, configured to establish an association relationship between the field word and the mapping character.
In one embodiment, the target candidate word derivation module 1202 is configured to: generating an initial candidate word set according to the proximity relation of characters in the text to be recognized; calculating word association degree and word independence degree of each initial candidate word in the initial candidate word set in the target text set; calculating the word generation degree of each initial candidate word according to the word association degree and the word independence degree; and screening the initial candidate word set according to the word generation degree of each initial candidate word to obtain a target candidate word.
In one embodiment, the text recognition apparatus further comprises: the word forming module is used for forming a new initial candidate word according to the initial candidate word and adjacent characters of the initial candidate word in the text to be recognized when the word independence degree corresponding to the initial candidate word is smaller than a first threshold; and the adding module is used for adding the new initial candidate words into the initial candidate word set.
In one embodiment, the correlation derivation module 1208 is configured to: calculating according to the target importance corresponding to the target candidate word and the reference importance to obtain the initial correlation degree of the target candidate word and the target field; determining a corresponding correlation degree confidence coefficient according to the occurrence times of the target candidate words in the target text set; and obtaining the target correlation degree according to the initial correlation degree and the correlation degree confidence degree.
As shown in fig. 14, in an embodiment, a text processing apparatus is provided, which may be integrated in the server 120 and the terminal 110, and specifically may include an initial input text obtaining module 1402, an association obtaining module 1404, a target domain word obtaining module 1406, and a target input text obtaining module 1408.
An initial input text acquisition module 1402, configured to acquire an initial input text input by an application;
an association relation obtaining module 1404, configured to obtain an association relation corresponding to a target field of the application, where the association relation is an association relation between a field word and a mapping character, and the field word is obtained by identifying according to a to-be-identified text of the application, a general field text set, and a target text set corresponding to the target field;
a target domain word obtaining module 1406, configured to determine a target domain word corresponding to the initial input text according to the initial input text and the association relationship;
and a target input text obtaining module 1408, configured to adjust the initial input text according to the target domain word to obtain a target input text.
In one embodiment, target input text derivation module 1408: acquiring each candidate input word corresponding to the initial input text; constructing a word relation chain set according to the composition relation of the words of the initial input text, the candidate input words and the target field words; calculating the transition probability of the forward word to the current word in the word relation chain; obtaining the connection strength of the word relation chain according to each transition probability corresponding to the word relation chain; and screening the word relation chain set according to the connection strength of the word relation chain to obtain a target word relation chain, and taking the text corresponding to the target word relation chain as a target input text.
In one embodiment, the initial input text acquisition module is to: acquiring a query sentence input by application, and taking the query sentence as an initial input text;
the text processing apparatus further includes: the query request module is used for acquiring a query request, and the query request comprises a target input text corresponding to a query statement; and the query response data acquisition module is used for acquiring query response data obtained according to the target input text.
In one embodiment, the text processing apparatus further includes: the target type acquisition module is used for detecting a target type corresponding to the target input text; and the filtering module is used for filtering the initial input text when the target type corresponding to the target input text is a preset type.
FIG. 15 is a diagram that illustrates an internal structure of the computer device in one embodiment. The computer device may specifically be the terminal 110 in fig. 1. As shown in fig. 15, the computer apparatus includes a processor, a memory, a network interface, an input device, and a display screen connected through a system bus. Wherein the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system and may further store a computer program that, when executed by the processor, causes the processor to implement at least one of a text recognition method and a text processing method. The internal memory may also have a computer program stored thereon, which, when executed by the processor, causes the processor to perform at least one of a text recognition method and a text processing method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
FIG. 16 is a diagram that illustrates an internal structure of the computer device in one embodiment. The computer device may specifically be the server 120 in fig. 1. As shown in fig. 16, the computer device includes a processor, a memory, and a network interface connected by a system bus. The memory comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system and may further store a computer program that, when executed by the processor, causes the processor to implement at least one of a text recognition method and a text processing method. The internal memory may also have a computer program stored therein, which when executed by the processor, causes the processor to perform at least one of a text recognition method and a text processing method.
It will be appreciated by those skilled in the art that the configurations shown in fig. 15 and 16 are block diagrams of only some of the configurations relevant to the present disclosure, and do not constitute a limitation on the computing devices to which the present disclosure may be applied, and that a particular computing device may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, the text recognition apparatus provided herein may be implemented in the form of a computer program that is executable on a computer device such as the computer devices shown in fig. 15 and 16. The memory of the computer device may store various program modules constituting the text recognition apparatus, such as a target candidate word obtaining module 1202, a set obtaining module 1204, an importance calculating module 1206, a relevance obtaining module 1208, and a domain word obtaining module 1210 shown in fig. 12. The program modules constitute computer programs that cause a processor to execute the steps in the text recognition methods of the embodiments of the present application described in the present specification.
For example, the computer device shown in fig. 16 may obtain a text to be recognized through the target candidate word obtaining module 1202 in the text recognition apparatus shown in fig. 12, and obtain a target candidate word according to characters in the text to be recognized; a set acquisition module 1204 acquires a general field text set and a target text set of a target field corresponding to a text to be recognized; calculating the target importance of the target candidate words in the target text set and the reference importance of the target candidate words in the general field text set through an importance calculation module 1206; calculating the target relevance between the target candidate word and the target field according to the target importance corresponding to the target candidate word and the reference importance by a relevance obtaining module 1208; the domain word obtaining module 1210 is configured to take the target candidate word as a domain word of the target domain according to the target relevancy.
In one embodiment, the text recognition apparatus provided herein may be implemented in the form of a computer program that is executable on a computer device such as the computer devices shown in fig. 15 and 16. The memory of the computer device may store various program modules constituting the text processing apparatus, such as an initial input text acquisition module 1402, an association relationship acquisition module 1404, a target domain word acquisition module 1406, and a target input text acquisition module 1408 shown in fig. 14. The computer program constituted by the respective program modules causes the processor to execute the steps in the text processing method of each embodiment of the present application described in the present specification.
For example, the computer device shown in fig. 16 may acquire the initial input text input by the application through the initial input text acquisition module 1402 in the text processing apparatus shown in fig. 14; acquiring an incidence relation corresponding to a target field of the application through an incidence relation acquisition module 1404, wherein the incidence relation is an incidence relation between a field word and a mapping character, and the field word is obtained by identification according to a text to be identified of the application, a general field text set and a target text set corresponding to the target field; determining a target domain word corresponding to the initial input text according to the initial input text and the association relationship by the target domain word obtaining module 1406; the initial input text is adjusted by the target input text obtaining module 1408 according to the target domain word to obtain the target input text.
In one embodiment, a computer device is proposed, the computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program: acquiring a text to be recognized, and obtaining target candidate words according to characters in the text to be recognized; acquiring a general field text set and a target text set of a target field corresponding to a text to be recognized; calculating the target importance of the target candidate words in the target text set and the reference importance of the target candidate words in the general field text set; calculating a target relevance between the target candidate word and the target field according to the target importance corresponding to the target candidate word and the reference importance; and taking the target candidate words as the field words of the target field according to the target relevance.
In one embodiment, after the processor executes taking the target candidate word as a domain word of the target domain according to the target relevance, the computer program further causes the processor to perform the steps of: determining mapping characters corresponding to the field words according to a mapping relation, wherein the mapping relation comprises at least one of shape-proximity mapping and sound-proximity mapping; and establishing an incidence relation between the domain words and the mapping characters.
In one embodiment, the obtaining the target candidate word according to the characters in the text to be recognized by the processor comprises: generating an initial candidate word set according to the adjacent relation of characters in the text to be recognized; calculating word association degree and word independence degree of each initial candidate word in the initial candidate word set in the target text set; calculating the word generation degree of each initial candidate word according to the word association degree and the word independence degree; and screening the target candidate words from the initial candidate word set according to the word generation degree of each initial candidate word.
In one embodiment, the processor performs the step of calculating the word association degree of each initial candidate word in the initial candidate word set in the target text set, including: determining corresponding association confidence according to the occurrence times of the initial candidate words in the target text set; determining the word initial association degree of the initial candidate words according to the occurrence probability of the initial candidate words in the target text set; and calculating to obtain a word target association degree according to the association confidence degree corresponding to the initial candidate word and the word initial association degree.
In one embodiment, the computer program further causes the processor to perform the steps of: when the word independence degree corresponding to the initial candidate word is smaller than a first threshold value, forming a new initial candidate word according to the initial candidate word and adjacent characters of the initial candidate word in the text to be recognized; and adding the new initial candidate word into the initial candidate word set.
In one embodiment, the step of calculating, by the processor, the target relevance between the target candidate word and the target field according to the target importance corresponding to the target candidate word and the reference importance includes: calculating according to the target importance corresponding to the target candidate word and the reference importance to obtain the initial correlation degree of the target candidate word and the target field; determining a corresponding correlation degree confidence coefficient according to the occurrence times of the target candidate words in the target text set; and obtaining the target correlation degree according to the initial correlation degree and the correlation degree confidence degree.
In one embodiment, a computer device is proposed, the computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program: acquiring an initial input text input by an application; acquiring an association relation corresponding to a target field of application, wherein the association relation is an association relation between a field word and a mapping character, and the field word is obtained by recognition according to a text to be recognized, a general field text set and a target text set corresponding to the target field; determining a target field word corresponding to the initial input text according to the initial input text and the association relation; and adjusting the initial input text according to the target field words to obtain a target input text.
In one embodiment, the processor performs the step of adjusting the initial input text according to the target domain word to obtain the target input text, including: acquiring each candidate input word corresponding to the initial input text; constructing a word relation chain set according to the composition relation of the words of the initial input text, the candidate input words and the target field words; calculating the transition probability of the forward word to the current word in the word relation chain; obtaining the connection strength of the word relation chain according to each transition probability corresponding to the word relation chain; and screening the word relation chain set according to the connection strength of the word relation chain to obtain a target word relation chain, and taking the text corresponding to the target word relation chain as a target input text.
In one embodiment, the obtaining of initial input text entered by the application performed by the processor comprises: acquiring a query sentence input by an application, and taking the query sentence as an initial input text; the computer program further causes the processor to perform the steps of: acquiring a query request, wherein the query request comprises a target input text corresponding to a query statement; and acquiring query response data obtained according to the target input text.
In one embodiment, the computer program further causes the processor to perform the steps of: detecting a target type corresponding to a target input text; and when the target type corresponding to the target input text is a preset type, filtering the initial input text.
In one embodiment, the computer program further causes the processor to perform the steps of: acquiring a text to be recognized of an application, and obtaining a target candidate word according to characters in the text to be recognized; acquiring a general field text set and a target text set; calculating the target importance of the target candidate words in the target text set and the reference importance of the target candidate words in the general field text set; calculating according to the target importance corresponding to the target candidate word and the reference importance to obtain a target correlation degree between the target candidate word and the target field; and taking the target candidate words as the field words of the target field according to the target relevance.
In one embodiment, after the target candidate word is taken as a domain word of the target domain according to the target relevance, the computer program further causes the processor to perform the steps of: determining mapping characters corresponding to the field words according to a mapping relation, wherein the mapping relation comprises at least one of shape-near mapping and sound-near mapping; and establishing an incidence relation between the field words and the corresponding mapping characters.
In one embodiment, a computer readable storage medium is provided, having a computer program stored thereon, which, when executed by a processor, causes the processor to perform the steps of: acquiring a text to be recognized, and obtaining a target candidate word according to characters in the text to be recognized; acquiring a general field text set and a target text set of a target field corresponding to a text to be recognized; calculating the target importance of the target candidate words in the target text set and the reference importance of the target candidate words in the general field text set; calculating a target relevance between the target candidate word and the target field according to the target importance corresponding to the target candidate word and the reference importance; and taking the target candidate words as the field words of the target field according to the target relevance.
In one embodiment, after the processor executes taking the target candidate word as a domain word of the target domain according to the target relevance, the computer program further causes the processor to perform the steps of: determining mapping characters corresponding to the field words according to a mapping relation, wherein the mapping relation comprises at least one of shape-proximity mapping and sound-proximity mapping; and establishing an incidence relation between the domain words and the mapping characters.
In one embodiment, the obtaining the target candidate word according to the characters in the text to be recognized, executed by the processor, includes: generating an initial candidate word set according to the proximity relation of characters in the text to be recognized; calculating word association degree and word independence degree of each initial candidate word in the initial candidate word set in the target text set; calculating the word generation degree of each initial candidate word according to the word association degree and the word independence degree; and screening the initial candidate word set according to the word generation degree of each initial candidate word to obtain a target candidate word.
In one embodiment, the calculating, by the processor, a word association degree of each initial candidate word in the initial candidate word set in the target text set includes: determining corresponding association confidence according to the occurrence times of the initial candidate words in the target text set; determining the word initial association degree of the initial candidate words according to the occurrence probability of the initial candidate words in the target text set; and calculating the word target association degree according to the association confidence degree corresponding to the initial candidate word and the word initial association degree.
In one embodiment, the computer program further causes the processor to perform the steps of: when the word independence degree corresponding to the initial candidate word is smaller than a first threshold value, forming a new initial candidate word according to the initial candidate word and adjacent characters of the initial candidate word in the text to be recognized; and adding the new initial candidate word into the initial candidate word set.
In one embodiment, the calculating, by the processor, the target relevance between the target candidate word and the target field according to the target importance corresponding to the target candidate word and the reference importance includes: calculating according to the target importance corresponding to the target candidate word and the reference importance to obtain the initial correlation degree of the target candidate word and the target field; determining a corresponding correlation degree confidence coefficient according to the occurrence times of the target candidate words in the target text set; and obtaining the target correlation degree according to the initial correlation degree and the correlation degree confidence degree.
In one embodiment, a computer-readable storage medium is provided, having stored thereon a computer program which, when executed by a processor, causes the processor to perform the steps of: acquiring an initial input text input by an application; acquiring an association relation corresponding to a target field of the application, wherein the association relation is an association relation between a field word and a mapping character, and the field word is obtained by recognition according to a text to be recognized, a general field text set and a target text set corresponding to the target field; determining a target domain word corresponding to the initial input text according to the initial input text and the association relation; and adjusting the initial input text according to the target field words to obtain a target input text.
In one embodiment, the processor performs the step of adjusting the initial input text according to the target domain word to obtain the target input text, including: acquiring each candidate input word corresponding to the initial input text; constructing a word relation chain set according to the composition relation of words of the initial input text, the candidate input words and the target field words; calculating the transition probability of the forward word to the current word in the word relation chain; obtaining the connection strength of the word relation chain according to each transition probability corresponding to the word relation chain; and screening the word relation chain set according to the connection strength of the word relation chain to obtain a target word relation chain, and taking the text corresponding to the target word relation chain as a target input text.
In one embodiment, the obtaining of initial input text entered by the application performed by the processor comprises: acquiring a query sentence input by an application, and taking the query sentence as an initial input text; the computer program further causes the processor to perform the steps of: acquiring a query request, wherein the query request comprises a target input text corresponding to a query statement; and acquiring query response data obtained according to the target input text.
In one embodiment, the computer program further causes the processor to perform the steps of: detecting a target type corresponding to a target input text; and when the target type corresponding to the target input text is a preset type, filtering the initial input text.
In one embodiment, the computer program further causes the processor to perform the steps of: acquiring a text to be recognized of an application, and obtaining a target candidate word according to characters in the text to be recognized; acquiring a general field text set and a target text set; calculating the target importance of the target candidate words in the target text set and the reference importance of the target candidate words in the general field text set; calculating according to the target importance corresponding to the target candidate word and the reference importance to obtain a target correlation degree between the target candidate word and the target field; and taking the target candidate words as the field words of the target field according to the target relevance.
In one embodiment, after the target candidate word is taken as a domain word of the target domain according to the target relevance, the computer program further causes the processor to perform the steps of: : determining mapping characters corresponding to the field words according to a mapping relation, wherein the mapping relation comprises at least one of shape-proximity mapping and sound-proximity mapping; and establishing an incidence relation between the field words and the corresponding mapping characters.
It should be understood that, although the steps in the flowcharts of the embodiments of the present invention are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in various embodiments may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, the computer program can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (22)

1. A method of text recognition, the method comprising:
acquiring a text to be recognized, and obtaining a target candidate word according to characters in the text to be recognized;
acquiring a general field text set and a target text set of a target field corresponding to the text to be recognized;
calculating the target importance of the target candidate words in the target text set and the reference importance of the target candidate words in the general field text set; the target importance and the frequency of occurrence of the target candidate word in the target text set form a positive correlation, and the reference importance and the frequency of occurrence of the target candidate word in the general field text set form a positive correlation;
calculating a target relevance between the target candidate word and the target field according to a target importance corresponding to the target candidate word and a reference importance;
selecting a preset number of target candidate words from each target candidate word according to the sequence of the target relevance from large to small as field words of the target field;
segmenting the initial input text to obtain each segmented word, and determining candidate input words corresponding to each segmented word; candidate input words corresponding to the segmented words are similar to the segmented words in shape or have the same pinyin, and a word relation chain set is constructed according to the composition relation of the words of the initial input text and each candidate input word; the initial input text belongs to the target field, each candidate input word comprises a target field word, and the target field word belongs to a field word of the target field; the word relation chain set comprises one or more word relation chains, and each word relation chain is formed by sequentially connecting candidate input words corresponding to each segmented word;
calculating the transition probability of transferring from at least one forward word to the current word in each word relation chain; the forward word is a word before the current word in the word relationship chain, and the transition probability of the at least one forward word to the current word represents the probability of the current word occurring if the at least one forward word occurs;
obtaining the connection strength of the word relation chain according to each transition probability corresponding to the word relation chain;
and screening the word relation chain set according to the connection strength of the word relation chain to obtain a target word relation chain, and taking the text corresponding to the target word relation chain as a target input text.
2. The method of claim 1, wherein after selecting a preset number of target candidate words from the target candidate words in an order from a large target relevance to a small target relevance as the domain words of the target domain, the method further comprises:
determining mapping characters corresponding to the field words according to a mapping relation, wherein the mapping relation comprises at least one of shape-near mapping and sound-near mapping;
and establishing an incidence relation between the domain words and the mapping characters.
3. The method of claim 1, wherein obtaining target candidate words according to the characters in the text to be recognized comprises:
generating an initial candidate word set according to the adjacent relation of the characters in the text to be recognized;
calculating word association degree and word independence degree of each initial candidate word in the initial candidate word set in the target text set;
calculating the word generation degree of each initial candidate word according to the word association degree and the word independence degree;
and screening the target candidate words from the initial candidate word set according to the word generation degree of each initial candidate word.
4. The method of claim 3, wherein the calculating the word association degree of each initial candidate word in the initial candidate word set in the target text set comprises:
determining corresponding associated confidence degrees according to the occurrence times of the initial candidate words in the target text set;
determining the initial word association degree of the initial candidate words according to the occurrence probability of the initial candidate words in the target text set;
and calculating to obtain word target association degree according to the association confidence degree corresponding to the initial candidate word and the word initial association degree.
5. The method of claim 3, further comprising:
when the word independence degree corresponding to the initial candidate word is smaller than a first threshold, forming a new initial candidate word according to the initial candidate word and adjacent characters of the initial candidate word in the text to be recognized;
and adding the new initial candidate word into the initial candidate word set.
6. The method of claim 1, wherein the calculating the target relevance between the target candidate word and the target field according to the target importance corresponding to the target candidate word and the reference importance comprises:
calculating to obtain an initial correlation degree of the target candidate word and the target field according to a target importance degree corresponding to the target candidate word and a reference importance degree;
determining a corresponding relevancy confidence degree according to the occurrence times of the target candidate words in the target text set;
and obtaining the target relevance according to the initial relevance and the relevance confidence.
7. The method of claim 1, further comprising:
detecting a target type corresponding to the target input text;
and when the target type corresponding to the target input text is a preset type, filtering the initial input text.
8. A method of text processing, the method comprising:
acquiring a query statement input in an application, and taking the query statement as an initial input text;
acquiring an association relation corresponding to a target field corresponding to the initial input text, wherein the association relation is an association relation between a field word and a mapping character, the field word is determined according to a target importance of a target candidate word in a target text set corresponding to the target field and a reference importance of the target candidate word in a general field text set, and the target candidate word is obtained according to characters in a text to be recognized corresponding to the target field; the target importance and the frequency of the target candidate words appearing in the target text set form a positive correlation, and the reference importance and the frequency of the target candidate words appearing in the general field text set form a positive correlation;
determining a target field word corresponding to the initial input text according to the initial input text and the incidence relation;
segmenting the initial input text to obtain each segmented word, and determining candidate input words corresponding to each segmented word; candidate input words corresponding to the segmented words are similar to the segmented words in shape or have the same pinyin, and a word relation chain set is constructed according to the composition relation of the words of the initial input text and each candidate input word; each candidate input word comprises a target field word, the word relation chain set comprises one or more word relation chains, and each word relation chain is a relation chain formed by sequentially connecting candidate input words corresponding to each segmented word;
calculating the transition probability of transferring from at least one forward word to the current word in each word relation chain; the forward word is a word before the current word in the word relationship chain, and the transition probability of the at least one forward word to the current word represents the probability of the current word appearing when the at least one forward word appears;
obtaining the connection strength of the word relation chain according to each transition probability corresponding to the word relation chain;
screening a target word relation chain from the word relation chain set according to the connection strength of the word relation chain, and taking a text corresponding to the target word relation chain as a target input text;
and acquiring a query request, wherein the query request comprises the target input text.
9. The method of claim 8, further comprising:
and acquiring query response data obtained according to the target input text.
10. The method of claim 8, further comprising:
acquiring the text to be recognized, and acquiring a target candidate word according to characters in the text to be recognized;
acquiring the universal field text set and the target text set;
calculating the target importance of the target candidate words in the target text set and the reference importance of the target candidate words in the general field text set;
calculating to obtain a target correlation degree between the target candidate word and the target field according to the target importance degree corresponding to the target candidate word and the reference importance degree;
and taking the target candidate word as a field word of the target field according to the target relevance.
11. A text recognition apparatus, the apparatus comprising:
the target candidate word obtaining module is used for obtaining a text to be recognized and obtaining a target candidate word according to characters in the text to be recognized;
the set acquisition module is used for acquiring a general field text set and a target text set of a target field corresponding to the text to be recognized;
the importance calculation module is used for calculating the target importance of the target candidate words in the target text set and the reference importance of the target candidate words in the general field text set; the target importance and the frequency of the target candidate words appearing in the target text set form a positive correlation, and the reference importance and the frequency of the target candidate words appearing in the general field text set form a positive correlation;
the relevancy obtaining module is used for calculating and obtaining the target relevancy between the target candidate word and the target field according to the target importance corresponding to the target candidate word and the reference importance;
a domain word obtaining module, configured to select a preset number of target candidate words from each target candidate word according to a sequence of target relevance from large to small, where the target candidate words serve as domain words of the target domain;
the target input text obtaining module is used for segmenting the initial input text to obtain each segmented word and determining candidate input words corresponding to each segmented word; candidate input words corresponding to the segmented words are similar to the segmented words in shape or have the same pinyin, and a word relation chain set is constructed according to the composition relation of the words of the initial input text and each candidate input word; the initial input text belongs to the target field, each candidate input word comprises a target field word, and the target field word belongs to a field word of the target field; the word relation chain set comprises one or more word relation chains, and each word relation chain is formed by sequentially connecting candidate input words corresponding to each segmented word; calculating the transition probability of transferring from at least one forward word to the current word in each word relation chain; the forward word is a word before the current word in the word relationship chain, and the transition probability of the at least one forward word to the current word represents the probability of the current word appearing when the at least one forward word appears; obtaining the connection strength of the word relation chain according to each transition probability corresponding to the word relation chain; and screening the word relation chain set according to the connection strength of the word relation chain to obtain a target word relation chain, and taking the text corresponding to the target word relation chain as a target input text.
12. The apparatus of claim 11, further comprising:
the mapping character determining module is used for determining the mapping characters corresponding to the field words according to a mapping relation, wherein the mapping relation comprises at least one of shape-proximity mapping and sound-proximity mapping;
and the incidence relation establishing module is used for establishing the incidence relation between the field words and the mapping characters.
13. The apparatus of claim 11, wherein the target candidate word derivation module is configured to:
generating an initial candidate word set according to the adjacent relation of the characters in the text to be recognized;
calculating word association degree and word independence degree of each initial candidate word in the initial candidate word set in the target text set;
calculating the word generation degree of each initial candidate word according to the word association degree and the word independence degree;
and screening the target candidate words from the initial candidate word set according to the word generation degree of each initial candidate word.
14. The apparatus of claim 13, wherein the target candidate word derivation module is further configured to:
determining corresponding associated confidence degrees according to the occurrence times of the initial candidate words in the target text set;
determining the initial word association degree of the initial candidate words according to the occurrence probability of the initial candidate words in the target text set;
and calculating to obtain word target association degree according to the association confidence degree corresponding to the initial candidate word and the word initial association degree.
15. The apparatus of claim 13, wherein the target candidate word derivation module is further configured to:
when the word independence degree corresponding to the initial candidate word is smaller than a first threshold, forming a new initial candidate word according to the initial candidate word and adjacent characters of the initial candidate word in the text to be recognized;
and adding the new initial candidate word into the initial candidate word set.
16. The apparatus of claim 11, wherein the correlation obtaining module is further configured to:
calculating to obtain an initial correlation degree of the target candidate word and the target field according to the target importance degree corresponding to the target candidate word and the reference importance degree;
determining a corresponding relevancy confidence degree according to the occurrence times of the target candidate words in the target text set;
and obtaining the target correlation degree according to the initial correlation degree and the correlation degree confidence degree.
17. The apparatus of claim 11, wherein the apparatus is further configured to:
detecting a target type corresponding to the target input text;
and when the target type corresponding to the target input text is a preset type, filtering the initial input text.
18. A text processing apparatus, the apparatus comprising:
an initial input text acquisition module, configured to acquire a query statement input in an application, and use the query statement as an initial input text;
the incidence relation obtaining module is used for obtaining the incidence relation corresponding to a target field corresponding to the initial input text, the incidence relation is the incidence relation between a field word and a mapping character, the field word is determined according to the target importance of a target candidate word in a target text set corresponding to the target field and the reference importance of the target candidate word in a general field text set, and the target candidate word is obtained according to characters in a text to be recognized corresponding to the target field; the target importance and the frequency of the target candidate words appearing in the target text set form a positive correlation, and the reference importance and the frequency of the target candidate words appearing in the general field text set form a positive correlation;
a target domain word obtaining module, configured to determine a target domain word corresponding to the initial input text according to the initial input text and the association relationship;
the target input text obtaining module is used for segmenting the initial input text to obtain each segmented word and determining candidate input words corresponding to each segmented word; candidate input words corresponding to the segmented words are similar to the segmented words in shape or have the same pinyin, and a word relation chain set is constructed according to the composition relation of the words of the initial input text and each candidate input word; each candidate input word comprises a target domain word, the word relation chain set comprises one or more word relation chains, and each word relation chain is a relation chain formed by sequentially connecting candidate input words corresponding to each segmented word; calculating the transition probability of transferring from at least one forward word to the current word in each word relation chain; the forward word is a word before the current word in the word relationship chain, and the transition probability of the at least one forward word to the current word represents the probability of the current word occurring if the at least one forward word occurs; obtaining the connection strength of the word relation chain according to each transition probability corresponding to the word relation chain; screening a target word relation chain from the word relation chain set according to the connection strength of the word relation chain, and taking a text corresponding to the target word relation chain as a target input text;
the apparatus is further configured to:
and acquiring a query request, wherein the query request comprises a target input text corresponding to the query statement.
19. The apparatus of claim 18,
the apparatus is further configured to:
and acquiring query response data obtained according to the target input text.
20. The apparatus of claim 18, wherein the apparatus is further configured to:
acquiring the text to be recognized, and obtaining target candidate words according to characters in the text to be recognized;
acquiring the universal field text set and the target text set;
calculating the target importance of the target candidate words in the target text set and the reference importance of the target candidate words in the general field text set;
calculating a target relevance between the target candidate word and the target field according to a target importance corresponding to the target candidate word and a reference importance;
and taking the target candidate word as a field word of the target field according to the target relevance.
21. A computer device comprising a memory and a processor, the memory having stored therein a computer program that, when executed by the processor, causes the processor to perform the steps of at least one of the text recognition method of any one of claims 1 to 7 and the text processing method of any one of claims 8 to 10.
22. A computer-readable storage medium, having stored thereon a computer program, which, when executed by a processor, causes the processor to carry out the steps of at least one of the method of text recognition according to any one of claims 1 to 7 and the method of text processing according to any one of claims 8 to 10.
CN201811168737.9A 2018-10-08 2018-10-08 Text recognition method, text processing method, text recognition device, text processing device, computer equipment and storage medium Active CN110162681B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811168737.9A CN110162681B (en) 2018-10-08 2018-10-08 Text recognition method, text processing method, text recognition device, text processing device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811168737.9A CN110162681B (en) 2018-10-08 2018-10-08 Text recognition method, text processing method, text recognition device, text processing device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN110162681A CN110162681A (en) 2019-08-23
CN110162681B true CN110162681B (en) 2023-04-18

Family

ID=67645117

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811168737.9A Active CN110162681B (en) 2018-10-08 2018-10-08 Text recognition method, text processing method, text recognition device, text processing device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110162681B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110765996B (en) * 2019-10-21 2022-07-29 北京百度网讯科技有限公司 Text information processing method and device
CN111552806B (en) * 2020-04-16 2021-11-02 重庆大学 Method for unsupervised construction of entity set in building field
CN111710328B (en) * 2020-06-16 2024-01-12 北京爱医声科技有限公司 Training sample selection method, device and medium for speech recognition model
CN112101020B (en) * 2020-08-27 2023-08-04 北京百度网讯科技有限公司 Method, apparatus, device and storage medium for training key phrase identification model
CN113743409A (en) * 2020-08-28 2021-12-03 北京沃东天骏信息技术有限公司 Text recognition method and device
CN112016305B (en) * 2020-09-09 2023-03-28 平安科技(深圳)有限公司 Text error correction method, device, equipment and storage medium
CN113744736B (en) * 2021-09-08 2023-12-08 北京声智科技有限公司 Command word recognition method and device, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11238051A (en) * 1998-02-23 1999-08-31 Toshiba Corp Chinese input conversion processor, chinese input conversion processing method and recording medium stored with chinese input conversion processing program
CN108170674A (en) * 2017-12-27 2018-06-15 东软集团股份有限公司 Part-of-speech tagging method and apparatus, program product and storage medium

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6904402B1 (en) * 1999-11-05 2005-06-07 Microsoft Corporation System and iterative method for lexicon, segmentation and language model joint optimization
US7092567B2 (en) * 2002-11-04 2006-08-15 Matsushita Electric Industrial Co., Ltd. Post-processing system and method for correcting machine recognized text
CN101572083B (en) * 2008-04-30 2011-09-07 富士通株式会社 Method and device for making up words by using prosodic words
CN106445906A (en) * 2015-08-06 2017-02-22 北京国双科技有限公司 Generation method and apparatus for medium-and-long phrase in domain lexicon
CN106681981B (en) * 2015-11-09 2019-10-25 北京国双科技有限公司 The mask method and device of Chinese part of speech
CN106708893B (en) * 2015-11-17 2018-09-28 华为技术有限公司 Search query word error correction method and device
CN107102746B (en) * 2016-02-19 2023-03-24 北京搜狗科技发展有限公司 Candidate word generation method and device and candidate word generation device
JP6703709B2 (en) * 2016-02-25 2020-06-03 国立研究開発法人情報通信研究機構 Automatic translation feature weight optimization apparatus and computer program therefor

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11238051A (en) * 1998-02-23 1999-08-31 Toshiba Corp Chinese input conversion processor, chinese input conversion processing method and recording medium stored with chinese input conversion processing program
CN108170674A (en) * 2017-12-27 2018-06-15 东软集团股份有限公司 Part-of-speech tagging method and apparatus, program product and storage medium

Also Published As

Publication number Publication date
CN110162681A (en) 2019-08-23

Similar Documents

Publication Publication Date Title
CN110162681B (en) Text recognition method, text processing method, text recognition device, text processing device, computer equipment and storage medium
CN110765763B (en) Error correction method and device for voice recognition text, computer equipment and storage medium
CN110457431B (en) Knowledge graph-based question and answer method and device, computer equipment and storage medium
CN110334179B (en) Question-answer processing method, device, computer equipment and storage medium
CN110598206A (en) Text semantic recognition method and device, computer equipment and storage medium
CN108491511B (en) Data mining method and device based on graph data and model training method and device
CN112215008A (en) Entity recognition method and device based on semantic understanding, computer equipment and medium
CN111461301B (en) Serialized data processing method and device, and text processing method and device
CN110175273B (en) Text processing method and device, computer readable storage medium and computer equipment
US11275888B2 (en) Hyperlink processing method and apparatus
CN109033427B (en) Stock screening method and device, computer equipment and readable storage medium
CN112650858B (en) Emergency assistance information acquisition method and device, computer equipment and medium
CN112632139A (en) Information pushing method and device based on PMIS system, computer equipment and medium
CN113204618A (en) Information identification method, device and equipment based on semantic enhancement and storage medium
CN114399396A (en) Insurance product recommendation method and device, computer equipment and storage medium
CN115481229A (en) Method and device for pushing answer call, electronic equipment and storage medium
CN115495553A (en) Query text ordering method and device, computer equipment and storage medium
CN111651574A (en) Event type identification method and device, computer equipment and storage medium
CN112632956A (en) Text matching method, device, terminal and storage medium
CN111400340A (en) Natural language processing method and device, computer equipment and storage medium
CN116049370A (en) Information query method and training method and device of information generation model
CN113204613B (en) Address generation method, device, equipment and storage medium
CN114242047A (en) Voice processing method and device, electronic equipment and storage medium
CN112328781B (en) Message recommendation method and system and electronic equipment
CN114218431A (en) Video searching method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant