CN111368506B

CN111368506B - Text processing method and device

Info

Publication number: CN111368506B
Application number: CN201811585329.3A
Authority: CN
Inventors: 刘恒友; 李辰; 包祖贻; 徐光伟; 李林琳; 司罗
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2018-12-24
Filing date: 2018-12-24
Publication date: 2023-04-28
Anticipated expiration: 2038-12-24
Also published as: CN111368506A

Abstract

The embodiment of the application provides a text processing method and device. Since the preset special vocabulary recognition model is obtained based on training of the conditional random field model, the preset special vocabulary recognition model can determine vocabularies which do not belong to a preset special vocabulary library but do not need to be corrected in the target text as special vocabularies according to the semantic environment of the context of the target text, and the correct vocabularies are not needed to be corrected, so that the determined special vocabularies in the target text can be determined as correct vocabularies. Compared with the prior art, whether the vocabulary in the special vocabulary library in the text is the correct vocabulary can be determined, and whether the vocabulary in the text which is not in the special vocabulary library is the correct vocabulary can be determined in the application.

Description

Text processing method and device

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a text processing method and apparatus.

Background

In the dialogue scene and the search scene, the user usually needs to input sentences in the terminal, however, incorrect vocabulary sometimes exists in the sentences input by the user, the terminal needs to determine whether the vocabulary in the dialogue sentence is the correct vocabulary, and if so, further operations are performed.

In the prior art, various types of special vocabulary commonly used in the market can be counted in advance, the special vocabulary comprises named entities, for example, entities with specific meaning in a language, including a person name, a place name, an organization name and the like, and then the counted various types of special vocabulary are formed into a special vocabulary set. Therefore, when a user inputs a conversation sentence on the terminal, the terminal can segment the conversation sentence to obtain a plurality of words in the conversation sentence, and for any one obtained word, the terminal can search whether the word exists in the special word library, and if the word exists in the special word library, the word is determined to be correct.

However, the inventors have found that in the prior art, if the vocabulary does not exist in the private vocabulary library, it cannot be determined whether the vocabulary is a correct vocabulary, and secondly, if the vocabulary includes a complex word, it cannot be determined whether the vocabulary is a correct vocabulary.

Disclosure of Invention

In order to solve the above technical problems, the embodiments of the present application show a text processing method and apparatus.

In a first aspect, embodiments of the present application show a text processing method, the method including:

acquiring a target text;

Acquiring a preset special vocabulary recognition model which is trained based on a conditional random field model and a preset special vocabulary library;

identifying the special vocabulary in the target text by using the preset special vocabulary identification model;

and determining the determined special vocabulary in the target text as the correct vocabulary.

In an optional implementation manner, the identifying the private vocabulary in the target text using the preset private vocabulary identifying model includes:

detecting whether a traditional Chinese character exists in the target text;

if the complex character exists in the target text, converting the complex character in the target text into a corresponding simplified character;

and determining the special vocabulary in the converted target text by using the preset special vocabulary recognition model.

In an alternative implementation, the method further includes:

determining the vocabulary except the special vocabulary in the target text as non-special vocabulary, wherein the non-special vocabulary comprises wrong vocabulary;

searching whether the wrong vocabulary which is the same as the non-proprietary vocabulary exists in a first corresponding relation between the wrong vocabulary and the correct vocabulary;

if the first corresponding relation has the same wrong vocabulary as the non-proprietary vocabulary, searching the correct vocabulary corresponding to the wrong vocabulary in the first corresponding relation;

Replacing the non-proprietary vocabulary with the correct vocabulary in the target text.

In an alternative implementation, the method further includes:

if the first corresponding relation does not have the wrong vocabulary which is the same as the non-proprietary vocabulary, determining a user for inputting the target text;

acquiring a user-defined vocabulary set of the user, wherein the user-defined vocabulary set stores correct vocabularies set by the user;

searching whether the non-proprietary vocabulary exists in the custom vocabulary set;

and if the non-proprietary vocabulary exists in the custom vocabulary set, determining the non-proprietary vocabulary in the target text as a correct vocabulary.

In an alternative implementation, the method further includes:

if the non-private vocabulary does not exist in the custom vocabulary set, pinyin of the non-private vocabulary is obtained;

determining candidate vocabularies of the non-private vocabularies according to the pinyin;

determining other words except the non-proprietary word in the target text, and determining the semantic smoothness of a reference text consisting of the other words and the candidate word;

If the semantic smoothness is greater than or equal to a preset smoothness, replacing the non-proprietary vocabulary with the candidate vocabulary in the target text;

and if the semantic smoothness is smaller than the preset smoothness, determining the non-proprietary vocabulary in the target text as the correct vocabulary.

In an alternative implementation, the determining the candidate vocabulary of the non-private vocabulary according to the pinyin includes:

the pinyin of the vocabulary adjacent to the non-private vocabulary in the target text is obtained;

the pinyin of the non-private vocabulary and the pinyin of the adjacent vocabulary are combined into a pinyin string according to the sequence of the vocabulary in the target text;

searching a vocabulary string corresponding to the pinyin string in a second corresponding relation between the pinyin string and the vocabulary string;

candidate words are determined in the word string.

In an alternative implementation, the method further includes:

if the non-private vocabulary does not exist in the custom vocabulary set, acquiring Chinese character codes of the non-private vocabulary;

determining candidate vocabularies of the non-private vocabularies according to the Chinese character codes;

In an optional implementation manner, the determining the candidate vocabulary of the non-private vocabulary according to the Chinese character encoding includes:

searching for Chinese character codes with similarity between the Chinese character codes being greater than preset similarity in a third corresponding relation between the Chinese character codes and the vocabulary;

and searching the vocabulary corresponding to the determined Chinese character codes in the third corresponding relation, and taking the vocabulary as a candidate vocabulary.

In an alternative implementation, the method further includes:

if the non-private vocabulary does not exist in the custom vocabulary set, the shape near vocabulary of the non-private vocabulary is obtained and used as the candidate vocabulary;

In an optional implementation manner, the obtaining the shape near vocabulary of the non-private vocabulary includes:

and searching the shape near vocabulary corresponding to the non-proprietary vocabulary in a fourth corresponding relation between the shape near vocabulary of the basic vocabulary and the basic vocabulary by taking the non-proprietary vocabulary as the basic vocabulary.

In an alternative implementation, the method further includes:

if the semantic smoothness is greater than or equal to a preset smoothness, acquiring the semantic smoothness of the target text;

and if the difference between the semantic meaning of the reference text and the semantic meaning of the target text is larger than the preset difference, executing the step of replacing the non-proprietary vocabulary with the candidate vocabulary in the target text.

In a second aspect, embodiments of the present application show a search method, the method including:

acquiring a search keyword input in a search box;

identifying the proprietary vocabulary in the search keywords by using the preset proprietary vocabulary identification model;

correcting the error of the words except the special words in the search keywords;

and searching by using the search keywords after error correction.

In a third aspect, embodiments of the present application show a text processing apparatus, the apparatus comprising:

the first acquisition module is used for acquiring target texts;

the second acquisition module is used for acquiring a preset special vocabulary recognition model which is trained based on the conditional random field model and a preset special vocabulary library;

the first recognition module is used for recognizing the special vocabulary in the target text by using the preset special vocabulary recognition model;

and the first determining module is used for determining the determined special vocabulary in the target text as the correct vocabulary.

In an alternative implementation, the first identification module includes:

the detection unit is used for detecting whether the complex character exists in the target text;

a modifying unit, configured to convert a traditional Chinese character in the target text into a corresponding simplified Chinese character if the traditional Chinese character exists in the target text;

And the first determining unit is used for determining the special vocabulary in the converted target text by using the preset special vocabulary recognition model.

In an alternative implementation, the apparatus further includes:

the second determining module is used for determining the vocabularies except the special vocabularies in the target text as non-special vocabularies, wherein the non-special vocabularies comprise wrong vocabularies;

the first searching module is used for searching whether the wrong vocabulary which is the same as the non-proprietary vocabulary exists in a first corresponding relation between the wrong vocabulary and the correct vocabulary;

the second searching module is used for searching correct vocabulary corresponding to the wrong vocabulary in the first corresponding relation if the wrong vocabulary which is the same as the non-proprietary vocabulary exists in the first corresponding relation;

and the first replacing module is used for replacing the non-proprietary vocabulary with the correct vocabulary in the target text.

In an alternative implementation, the apparatus further includes:

a third determining module, configured to determine a user who inputs the target text if there is no wrong vocabulary identical to the non-proprietary vocabulary in the first correspondence;

The first acquisition module is used for acquiring a custom vocabulary set of the user, wherein the custom vocabulary set stores correct vocabularies set by the user;

the third searching module is used for searching whether the non-proprietary vocabulary exists in the custom vocabulary set;

and a fourth determining module, configured to determine the non-private vocabulary in the target text as a correct vocabulary if the non-private vocabulary exists in the custom vocabulary set.

In an alternative implementation, the apparatus further includes:

the second acquisition module is used for acquiring the pinyin of the non-proprietary vocabulary if the non-proprietary vocabulary does not exist in the custom vocabulary set;

a fifth determining module, configured to determine candidate vocabularies of the non-private vocabulary according to the pinyin;

a sixth determining module, configured to determine other words except the non-private word in the target text, and determine a semantic meaning of a reference text composed of the other words and the candidate word;

the second replacing module is used for replacing the non-proprietary vocabulary with the candidate vocabulary in the target text if the semantic smoothness is greater than or equal to a preset smoothness;

And a seventh determining module, configured to determine the non-private vocabulary in the target text as a correct vocabulary if the semantic smoothness is less than a preset smoothness.

In an alternative implementation, the fifth determining module includes:

the acquisition unit is used for acquiring pinyin of words adjacent to the non-private word in the target text;

the combination unit is used for combining the pinyin of the non-private vocabulary and the pinyin of the adjacent vocabulary into a pinyin string according to the sequence of the vocabulary in the target text;

the first searching unit is used for searching the vocabulary string corresponding to the pinyin string in the second corresponding relation between the pinyin string and the vocabulary string;

and the second determining unit is used for determining candidate vocabularies in the vocabulary string.

In an alternative implementation, the apparatus further includes:

the third acquisition module is used for acquiring Chinese character codes of the non-proprietary vocabulary if the non-proprietary vocabulary does not exist in the custom vocabulary set;

an eighth determining module, configured to determine candidate vocabularies of the non-private vocabularies according to the Chinese character codes;

the sixth determining module is further configured to determine other words except the non-private word in the target text, and determine a semantic meaning of a reference text composed of the other words and the candidate word;

The second replacing module is further configured to replace the non-private vocabulary with the candidate vocabulary in the target text if the semantic smoothness is greater than or equal to a preset smoothness;

and the seventh determining module is further configured to determine the non-private vocabulary in the target text as a correct vocabulary if the semantic smoothness is less than a preset smoothness.

In an alternative implementation, the eighth determining module includes:

the second searching unit is used for searching Chinese character codes with similarity larger than preset similarity in a third corresponding relation between the Chinese character codes and the vocabulary;

and the third searching unit is used for searching the vocabulary corresponding to the determined Chinese character codes in the third corresponding relation and taking the vocabulary as a candidate vocabulary.

In an alternative implementation, the apparatus further includes:

a fourth obtaining module, configured to obtain a shape near vocabulary of the non-private vocabulary and serve as the candidate vocabulary if the non-private vocabulary does not exist in the custom vocabulary set;

In an optional implementation manner, the fourth obtaining module is specifically configured to: and searching the shape near vocabulary corresponding to the non-proprietary vocabulary in a fourth corresponding relation between the shape near vocabulary of the basic vocabulary and the basic vocabulary by taking the non-proprietary vocabulary as the basic vocabulary.

In an alternative implementation, the apparatus further includes:

a fifth obtaining module, configured to obtain the semantic smoothness of the target text if the semantic smoothness is greater than or equal to a preset smoothness;

the second replacing module is further configured to replace the non-private vocabulary with the candidate vocabulary in the target text if a difference between the semantic meaning of the reference text and the semantic meaning of the target text is greater than a preset difference.

In a fourth aspect, embodiments of the present application show a search apparatus, the apparatus comprising:

a sixth acquisition module for acquiring a search keyword input in a search box;

a seventh acquisition module, configured to acquire a preset private vocabulary recognition model that is trained based on the conditional random field model and a preset private vocabulary library;

the second recognition module is used for recognizing the special vocabulary in the search keywords by using the preset special vocabulary recognition model;

the error correction module is used for correcting the vocabulary except the special vocabulary in the search keywords;

and the search module is used for searching by using the search keywords after error correction.

In a fifth aspect, embodiments of the present application show an electronic device, including:

a processor; and

a memory having executable code stored thereon that, when executed, causes the processor to perform the text processing method as described in the first aspect.

In a sixth aspect, embodiments of the present application show one or more machine-readable media having executable code stored thereon that, when executed, cause a processor to perform the text processing method of the first aspect.

In a seventh aspect, embodiments of the present application show an electronic device, including:

a processor; and

a memory having executable code stored thereon that, when executed, causes the processor to perform the search method as described in the second aspect.

In an eighth aspect, embodiments of the present application show one or more machine-readable media having stored thereon executable code that, when executed, causes a processor to perform a search method as described in the second aspect.

Compared with the prior art, the embodiment of the application has the following advantages:

in the application, acquiring a target text; acquiring a preset special vocabulary recognition model which is trained based on a conditional random field model and a preset special vocabulary library; using a preset special vocabulary recognition model to recognize special vocabulary in the target text; and determining the determined special vocabulary in the target text as the correct vocabulary.

Since the preset special vocabulary recognition model is obtained based on training of the conditional random field model, the preset special vocabulary recognition model can determine vocabularies which do not belong to a preset special vocabulary library but do not need to be corrected in the target text as special vocabularies according to the semantic environment of the context of the target text, and the correct vocabularies are not needed to be corrected, so that the determined special vocabularies in the target text can be determined as correct vocabularies. Compared with the prior art, whether the vocabulary in the special vocabulary library in the text is the correct vocabulary can be determined, and whether the vocabulary in the text which is not in the special vocabulary library is the correct vocabulary can be determined in the application.

Drawings

Fig. 1 is a flow chart illustrating a text processing method according to an exemplary embodiment.

Fig. 2 is a flow chart illustrating a text processing method according to an exemplary embodiment.

Fig. 3 is a flow chart illustrating a text processing method according to an exemplary embodiment.

Fig. 4 is a flowchart illustrating a text processing method according to an exemplary embodiment.

Fig. 5 is a flowchart illustrating a text processing method according to an exemplary embodiment.

Fig. 6 is a flowchart illustrating a text processing method according to an exemplary embodiment.

Fig. 7 is a flow chart illustrating a search method according to an exemplary embodiment.

Fig. 8 is a block diagram illustrating a text processing device according to an exemplary embodiment.

Fig. 9 is a block diagram of a search apparatus according to an exemplary embodiment.

Fig. 10 is a block diagram of a text processing device according to an exemplary embodiment.

Detailed Description

In order that the above-recited objects, features and advantages of the present application will become more readily apparent, a more particular description of the invention briefly described above will be rendered by reference to specific embodiments that are illustrated in the appended drawings.

Fig. 1 is a flowchart illustrating a text processing method according to an exemplary embodiment, which is used in an electronic device including a terminal or a server, etc., as shown in fig. 1, and includes the following steps.

In step S101, a target text is acquired.

In the present application, the target text includes text input by the user in the electronic device, for example, a text message input by the user in the electronic device during a conversation with the friend and used for sending to the friend, or includes text downloaded by the electronic device from the network, which is not limited in this application.

In step S102, a preset private vocabulary recognition model trained based on the conditional random field model and the preset private vocabulary library is acquired.

In the method, a sample text set can be obtained in advance, the sample text set comprises a plurality of sample texts marked with special words, the marked special words can be located in a preset special word library, then the sample texts in the sample text set are used for training a conditional random field model, the semantic environment of the context of the sample text can be combined in each training round, whether the special words in the predicted sample texts are words needing no correction or not is detected manually by the conditional random field model, and a detection result is applied to the next training round so as to continuously modify parameters in the conditional random field model until the parameters in the conditional random field model are converged, so that a preset special word recognition model is obtained, and then the preset special word recognition model is stored.

Thus, in this step, the stored preset private vocabulary recognition model may be directly acquired, and then step S103 is performed.

In step S103, the private vocabulary in the target text is recognized using the preset private vocabulary recognition model.

In the application, the target text can be input into a preset special vocabulary recognition model to obtain the special vocabulary output by the preset special vocabulary recognition model.

In step S104, the determined private vocabulary in the target text is determined as the correct vocabulary.

In the present application, a target text is acquired. And acquiring a preset special vocabulary recognition model which is trained based on the conditional random field model and a preset special vocabulary library. And identifying the special vocabulary in the target text by using a preset special vocabulary identification model. And determining the determined special vocabulary in the target text as the correct vocabulary.

In practical situations, the Chinese characters comprise simplified words and traditional Chinese words, and people can use the simplified words and rarely use the traditional Chinese words in most cases, so that when the conditional random field model is trained, the characters in all sample texts in the sample text set are often simplified words, and therefore the preset special vocabulary recognition model can only recognize the simplified words.

However, sometimes, the target text may include a traditional Chinese character, if the target text includes a traditional Chinese character, the preset special vocabulary recognition model may determine the vocabulary in which the traditional Chinese character is located as a non-special vocabulary because the preset special vocabulary recognition model cannot recognize the traditional Chinese character, that is, a situation that a certain vocabulary is not corrected, that is, a correct vocabulary, but the vocabulary cannot be determined as a correct vocabulary may occur.

Wherein the non-proprietary vocabulary includes a vocabulary in the target text other than the determined proprietary vocabulary.

To avoid this, in another embodiment of the present application, in step S103, it may be detected whether a traditional Chinese character exists in the target text, if the traditional Chinese character exists in the target text, the traditional Chinese character in the target text is converted into a corresponding simplified Chinese character, and then the private vocabulary in the converted target text is determined using a preset private vocabulary recognition model.

For example, a correspondence between traditional Chinese characters and simplified Chinese characters may be set in advance, where a first column stores all traditional Chinese characters in the Chinese characters and a second column stores simplified Chinese characters corresponding to each traditional Chinese character, so that, for any one Chinese character in the target text, whether the Chinese character exists or not may be searched in the first column in the correspondence, if the Chinese character exists, the Chinese character is indicated as the traditional Chinese character, and the simplified Chinese character corresponding to the Chinese character is searched in the second column, and then the searched simplified Chinese character is used to replace the Chinese character in the target text. The above operations are also performed for each of the other chinese characters in the target text.

Further, the target text often includes a plurality of words, a portion of which is determined to be the correct word in the embodiment shown in fig. 1, but another portion of which is not determined to be the correct word in the embodiment shown in fig. 1, and these words are non-proprietary words.

In this application, the embodiment shown in FIG. 1 does not determine whether the non-proprietary vocabulary is the correct vocabulary. There is therefore a further need to continue to determine whether the non-proprietary vocabulary is the correct vocabulary by other means, in particular, see fig. 2, the method further comprising:

In step S201, the vocabulary other than the private vocabulary in the target text is determined as the non-private vocabulary, and the non-private vocabulary includes the wrong vocabulary;

in step S202, in the first correspondence between the wrong vocabulary and the correct vocabulary, it is searched whether there is the same wrong vocabulary as the non-proprietary vocabulary.

The technician can count the wrong vocabulary which is easy to be input by the vast users in the history process in advance, determine the correct vocabulary corresponding to the wrong vocabulary which is easy to be input by the vast users, then form the wrong vocabulary and the correct vocabulary into a corresponding table entry, and store the wrong vocabulary and the correct vocabulary in a first corresponding relation between the preset wrong vocabulary and the correct vocabulary.

Thus, in this step, in order to determine whether the non-private vocabulary is a correct vocabulary, it may be determined whether the non-private vocabulary is an incorrect vocabulary, and if it is determined that the non-private vocabulary is an incorrect vocabulary, it may be determined that the non-private vocabulary is not necessarily a correct vocabulary, so that it may be found whether there is an incorrect vocabulary identical to the non-private vocabulary in a first correspondence relationship between a preset incorrect vocabulary and a correct vocabulary.

If there is an erroneous vocabulary identical to the non-exclusive vocabulary in the first correspondence, in step S203, a correct vocabulary corresponding to the erroneous vocabulary is found in the first correspondence.

In this application, if there is an erroneous vocabulary that is the same as the non-proprietary vocabulary in the first correspondence, it is explained that in the history process, the non-proprietary vocabulary is an erroneous vocabulary that is easily input by a large number of users, and the non-proprietary vocabulary in the target text may also be an erroneous vocabulary that is input by the users, so it may be determined that the non-proprietary vocabulary is an erroneous vocabulary, that is, it may be determined that the non-proprietary vocabulary is not necessarily an accurate vocabulary, and at this time, the non-proprietary vocabulary in the target text may be corrected, for example, the correct vocabulary corresponding to the erroneous vocabulary is found in the first correspondence, and then step S204 is performed.

In step S204, the non-proprietary vocabulary is replaced with the correct vocabulary in the target text.

In the embodiment shown in fig. 2, if there is no wrong vocabulary identical to the non-proprietary vocabulary in the first correspondence relationship, it cannot be determined whether the non-proprietary vocabulary is the correct vocabulary. It is therefore also necessary to continue to determine by other means whether the non-proprietary vocabulary is the correct vocabulary, in particular, see fig. 3, the method further comprising:

if there is no wrong vocabulary identical to the non-private vocabulary in the first correspondence relationship, in step S301, a user who inputs the target text is determined.

In the application, before the user inputs the target text on the electronic device, the user account of the user needs to be input on the electronic device first, and the user logs in the background server through the user account of the user, so that the electronic device can determine the user who inputs the target text according to the user account logged in the background server.

In step S302, a user-defined vocabulary set of the user is obtained, and correct vocabularies set by the user are stored in the user-defined vocabulary set.

In this application, although some vocabularies are not widely used by the majority of users, they may be frequently used in a small part of users in time, for example, a "box Ma Shengxian" or the like, any vocabulary which is commonly used by a user and is self-considered correct and not required to be corrected may be set in advance for one user, and then the set vocabularies are formed into a custom vocabulary set of the user.

Therefore, whether the non-proprietary vocabulary exists can be searched in the user-defined vocabulary set. If the non-private vocabulary exists in the custom vocabulary set, the non-private vocabulary in the target text can be determined to be the correct vocabulary.

In step S303, a search is made in the custom vocabulary set for the presence of non-proprietary vocabulary.

If the non-private vocabulary exists in the custom vocabulary set, in step S304, the non-private vocabulary in the target text is determined to be the correct vocabulary.

In the embodiment shown in fig. 3, if the non-private vocabulary does not exist in the custom vocabulary set, it cannot be determined whether the non-private vocabulary is the correct vocabulary. It is therefore also necessary to continue to determine by other means whether the non-proprietary vocabulary is the correct vocabulary, in particular, see fig. 4, the method further comprising:

if the non-private vocabulary does not exist in the custom vocabulary set, in step S401, pinyin of the non-private vocabulary is obtained.

In the step, each Chinese character in the non-exclusive vocabulary can be determined, the pinyin corresponding to each Chinese character is searched in the corresponding relation between the Chinese characters and the pinyin of the Chinese character, and then the searched pinyin is combined according to the position sequence of the corresponding Chinese characters in the non-exclusive vocabulary to obtain the pinyin of the non-exclusive vocabulary.

In step S402, candidate words for the non-private word are determined based on the pinyin.

The step can be realized by the following steps:

4021. and acquiring pinyin of a vocabulary adjacent to the non-proprietary vocabulary in the target text.

The manner of obtaining the pinyin of the vocabulary adjacent to the non-private vocabulary in the target text may refer to step S401, which is not described in detail herein.

4022. And combining the pinyin of the non-private vocabulary and the pinyin of the adjacent vocabulary into a pinyin string according to the sequence of the vocabulary in the target text.

The target text comprises a plurality of words, the plurality of words comprise non-proprietary words, in the target text, if the non-proprietary word is the first word in the target text, the word adjacent to the non-proprietary word comprises the second word in the target text, and if the non-proprietary word is the last word in the target text, the word adjacent to the non-proprietary word comprises the penultimate word in the target text.

If the non-proprietary word is not the first word and the last word in the target text, then the words adjacent to the non-proprietary word include the two words immediately to the left and right of the non-proprietary word in the target text.

For example, assume that the target text is "today too itchy rises very rapidly," and the terms "today," "too itchy," "rising," "very," and "rapid" are included. Assuming that the non-proprietary vocabulary is "too itchy," the vocabularies adjacent to the non-proprietary vocabulary are "today" and "rising.

The pinyin of the non-proprietary vocabulary is "taiyang", the pinyin of the adjacent vocabulary to the non-proprietary vocabulary is "today" and "rising" is "jindian" and "shngqi", respectively.

The pinyin of the non-proprietary vocabulary is "taiyang", "today" audio "jintian" and "raised" pinyin "shangqi" are combined into a pinyin string "jintintaiyangshangqi".

4023. And searching the vocabulary string corresponding to the pinyin string in a second corresponding relation between the pinyin string and the vocabulary string.

The technician can count the phonetic strings frequently input by the vast users in the history process in advance, determine the vocabulary strings corresponding to the frequently input phonetic strings, wherein one or a plurality of vocabulary strings corresponding to the same phonetic string can be provided, and then respectively form corresponding table entries by the frequently input phonetic strings and each determined vocabulary string, and store the corresponding table entries in a second corresponding relation between the phonetic strings and the vocabulary strings.

Thus, in this step, the vocabulary string corresponding to the pinyin string may be found in the second correspondence between the pinyin string and the vocabulary string.

4024. Candidate words are determined in the word string.

In the present application, the position of the pinyin of the non-private vocabulary in the pinyin string may be determined, and then the vocabulary at the position in the vocabulary string may be obtained and used as a candidate vocabulary.

In step S403, other words than the exclusive word in the target text are determined, and the semantic meaning of the reference text composed of the other words and the candidate word is determined.

In the application, candidate words can be used for replacing non-proprietary words in the target text to obtain the reference text, and then the semantic smoothness of the reference text can be calculated through a KENLM algorithm and an SRILM algorithm.

In another embodiment, if a plurality of candidate words are determined in step S402, the semantic meaning of the reference text composed of the other words and each candidate word, respectively, may be determined, and then the maximum semantic meaning is selected.

If the semantic meaning is greater than or equal to the preset meaning, the candidate vocabulary is used to replace the non-proprietary vocabulary in the target text in step S404.

In the present application, a plurality of texts only including correct vocabularies may be counted in advance, then, a semantic through degree of each text is obtained, and then, a preset through degree is determined according to the semantic through degree of each text, for example, the lowest semantic through degree is used as the preset through degree, or an average value between at least two semantic through degrees is used as the preset through degree.

Multiple texts containing wrong words can be counted in advance, and then the semantic smoothness of each text is obtained. It is generally found that the text containing the wrong vocabulary has a smaller semantic meaning and tends to have a smaller meaning than the predetermined meaning, whereas the text not containing the wrong vocabulary has a larger meaning and tends to have a larger meaning than the predetermined meaning.

Therefore, if the semantic meaning is greater than or equal to the preset meaning, the reference text is described as not containing the wrong text, and the non-private vocabulary cannot be determined to be the wrong vocabulary, but in order to ensure that the target text does not contain the wrong vocabulary, the candidate vocabulary can be used for replacing the non-private vocabulary in the target text.

If the semantic meaning is less than the preset meaning, in step S405, the non-proprietary vocabulary in the target text is determined to be the correct vocabulary.

In the embodiment shown in fig. 3, if the non-private vocabulary does not exist in the custom vocabulary set, it cannot be determined whether the non-private vocabulary is the correct vocabulary. It is therefore also necessary to continue to determine by other means whether the non-proprietary vocabulary is the correct vocabulary, in particular, see fig. 5, the method further comprising:

If the non-private vocabulary does not exist in the custom vocabulary set, in step S501, a Chinese character code of the non-private vocabulary is obtained.

In the step, each Chinese character in the non-proprietary vocabulary can be determined, the Chinese character codes corresponding to each Chinese character respectively are searched in the corresponding relation between the Chinese characters and the Chinese character codes of the Chinese characters, and then the searched Chinese character codes are combined according to the position sequence of the corresponding Chinese characters in the non-proprietary vocabulary to obtain the Chinese character codes of the non-proprietary vocabulary.

The Chinese character codes can be five-stroke codes and the like.

In step S502, candidate vocabularies of the non-exclusive vocabulary are determined according to the chinese character codes.

In a third corresponding relation between the Chinese character codes and the vocabulary, searching the Chinese character codes with similarity larger than the similarity between the Chinese character codes. And searching the vocabulary corresponding to the determined Chinese character codes in the third corresponding relation, and taking the vocabulary as a candidate vocabulary.

The preset similarity between the Chinese character codes can be determined according to the editing distance between the Chinese character codes.

In step S503, other words than the exclusive word in the target text are determined, and the semantic meaning of the reference text composed of the other words and the candidate word is determined.

If the semantic meaning is greater than or equal to the preset meaning, the candidate vocabulary is used to replace the non-proprietary vocabulary in the target text in step S504.

If the semantic meaning is less than the preset meaning, in step S505, the non-proprietary vocabulary in the target text is determined to be the correct vocabulary.

The specific implementation manner of step S503 to step S505 may be referred to as step S403 to step S405, and will not be described in detail herein.

In the embodiment shown in fig. 3, if the non-private vocabulary does not exist in the custom vocabulary set, it cannot be determined whether the non-private vocabulary is the correct vocabulary. It is therefore also necessary to continue to determine by other means whether the non-proprietary vocabulary is the correct vocabulary, in particular, see fig. 6, the method further comprising:

if the non-private vocabulary does not exist in the custom vocabulary set, in step S601, the near-shape vocabulary of the non-private vocabulary is obtained and used as the candidate vocabulary.

In this step, the non-private vocabulary may be used as a basic vocabulary, and the shape near vocabulary corresponding to the non-private vocabulary may be found in the fourth correspondence between the basic vocabulary and the shape near vocabulary of the basic vocabulary.

In step S602, other words than the exclusive word in the target text are determined.

In step S603, the semantic meaning of the reference text composed of the other vocabulary and the candidate vocabulary is determined.

If the semantic meaning is greater than or equal to the preset meaning, the candidate vocabulary is used to replace the non-proprietary vocabulary in the target text in step S604.

If the semantic meaning is less than the preset meaning, in step S605, the non-proprietary vocabulary in the target text is determined to be the correct vocabulary.

The specific implementation manner of step S602 to step S604 may refer to step S403 to step S405, which will not be described in detail herein.

In this application, the text of frequently entered words containing errors during the history can be counted.

Then, for any text containing wrong words, acquiring the semantic meaning of the text, then manually correcting the text so that the corrected text does not contain wrong words, acquiring the semantic meaning of the corrected text, acquiring the difference between the corrected semantic meaning of the text and the semantic meaning of the text before correction, and executing the operation on the text containing wrong words.

Thus, a plurality of differences between the semantic meaning of the corrected text and the semantic meaning of the text before correction can be obtained, and the fact that the difference between the semantic meaning of the corrected text and the semantic meaning of the text before correction is often larger can be found.

Secondly, for the text of any one of the included error vocabularies, the semantic compliance of the text is obtained, then the text can be manually modified so that the modified text also comprises the error vocabularies, for example, the error vocabularies in the text are replaced by another error vocabularies, the semantic compliance of the modified text is obtained, the difference between the semantic compliance of the modified text and the semantic compliance of the text before modification is obtained, and the operation is carried out for the text of each of the other included error vocabularies.

Thus, a plurality of differences between the semantic meaning of the modified text and the semantic meaning of the text before modification can be obtained, and the difference between the semantic meaning of the modified text and the semantic meaning of the text before modification can be found to be small.

In the present application, during the history, the text of frequently input words which do not include errors may also be counted.

Then, for any text that does not contain any wrong vocabulary, the semantic meaning of the text is obtained, then the text can be manually modified so that the modified text still does not contain any wrong vocabulary, for example, the correct vocabulary in the text is replaced by another correct vocabulary, the semantic meaning of the modified text is obtained, the difference between the semantic meaning of the modified text and the semantic meaning of the text before modification is obtained, and the above operation is performed for each text that does not contain any wrong vocabulary.

According to the above three cases, it can be summarized that if a text containing an erroneous vocabulary is corrected to a text not containing an erroneous vocabulary, the difference between the semantic meaning of the corrected text and the semantic meaning of the text before correction is often larger than the preset difference.

If a text containing an erroneous word is modified to a text still containing the erroneous word, the difference between the semantic meaning of the modified text and the semantic meaning of the text before modification is often smaller than the preset difference.

If a text that does not contain an erroneous word is modified to a text that still does not contain an erroneous word, the difference between the semantic meaning of the modified text and the semantic meaning of the text before modification is often smaller than the preset difference.

That is, if the difference between the semantic meaning of the reference text and the semantic meaning of the target text is greater than the preset difference, it is indicated that the target text contains wrong vocabulary, and the vocabulary in the reference text is correct vocabulary, and the candidate vocabulary can be used to replace non-private vocabulary in the target text.

If the difference between the semantic meaning of the reference text and the semantic meaning of the target text is less than or equal to the preset difference, it is not necessary to modify one text containing the wrong vocabulary into one text containing the other wrong vocabulary, or it is not necessary to modify one text containing only the correct vocabulary into another text containing only the correct vocabulary, or it is possible that the meaning of the target text is modified.

Therefore, in another embodiment of the present application, if the semantic smoothness is greater than or equal to the preset smoothness, the semantic smoothness of the target text may be obtained. If the difference between the semantic meaning of the reference text and the semantic meaning of the target text is larger than the preset difference, the candidate vocabulary is used for replacing the non-proprietary vocabulary in the target text.

Fig. 7 is a flowchart illustrating a search method according to an exemplary embodiment, which is used in an electronic device including a terminal or a server, etc., as shown in fig. 7, and includes the following steps.

In step S701, a search keyword input in a search box is acquired;

in the application, when a user needs to search, a search keyword can be input into a search box displayed on a screen of the electronic device, the search keyword comprises at least one vocabulary, and the electronic device acquires the search keyword input by the user in the search box.

In step S702, a preset private vocabulary recognition model trained based on the conditional random field model and a preset private vocabulary library is obtained;

in the method, a sample search keyword set can be obtained in advance, the sample search keyword set comprises a plurality of sample search keywords marked with special vocabularies, the marked special vocabularies can be located in a preset special vocabulary library, then a conditional random field model is trained by using the sample search keywords in the sample search keyword set, semantic environments of the contexts of the sample search keywords can be combined in each training round, whether the special vocabularies in the predicted sample search keywords are vocabularies without correction or not is detected manually by the conditional random field model, and detection results are applied to the training of the next round so as to continuously modify parameters in the conditional random field model until the parameters in the conditional random field model are converged, so that a preset special vocabulary recognition model is obtained, and then the preset special vocabulary recognition model is stored.

In step S703, a proprietary vocabulary in the search keyword is recognized using a preset proprietary vocabulary recognition model;

in the application, the search keywords can be input into a preset special vocabulary recognition model to obtain the special vocabulary output by the preset special vocabulary recognition model.

In step S704, error correction is performed on the vocabulary other than the exclusive vocabulary in the search keyword;

wherein, the method of the embodiment shown in fig. 2-6 can be referred to for correcting the error of the words except the special words in the search keywords.

In step S705, a search is performed using the search keyword after error correction.

In the application, since the preset special vocabulary recognition model is obtained based on the training of the conditional random field model, the preset special vocabulary recognition model can determine the vocabularies which do not belong to the preset special vocabulary library but do not need to be corrected in the search keywords as the special vocabularies according to the semantic environment of the context of the search keywords, and the correct vocabularies are not required to be corrected, so that the determined special vocabularies in the search keywords can be determined as the correct vocabularies. Compared with the prior art, whether the vocabulary in the special vocabulary library in the text is the correct vocabulary can be determined, whether the vocabulary in the text which is not in the special vocabulary library is the correct vocabulary can be determined in the application, so that compared with the prior art, more correct vocabularies can be determined for the same text, and then correction is performed on vocabularies except the special vocabulary in the search keywords, so that no correction can be avoided, and the correction accuracy is improved.

Fig. 8 is a block diagram of a text processing device according to an exemplary embodiment, as shown in fig. 8, the device including:

a first obtaining module 11, configured to obtain a target text;

a second obtaining module 12, configured to obtain a preset private vocabulary recognition model that is trained based on the conditional random field model and a preset private vocabulary library;

a first recognition module 13, configured to recognize a private vocabulary in the target text using the preset private vocabulary recognition model;

a first determining module 14, configured to determine the determined private vocabulary in the target text as a correct vocabulary.

In an alternative implementation, the first identification module 13 includes:

In an alternative implementation, the apparatus further includes:

In an alternative implementation, the fifth determining module includes:

In an alternative implementation, the apparatus further includes:

In an alternative implementation, the eighth determining module includes:

In an alternative implementation, the apparatus further includes:

Fig. 9 is a block diagram of a search apparatus according to an exemplary embodiment, as shown in fig. 9, the apparatus including:

a sixth acquisition module 21 for acquiring a search keyword input in a search box;

a seventh obtaining module 22, configured to obtain a preset private vocabulary recognition model that is trained based on the conditional random field model and a preset private vocabulary library;

a second recognition module 23, configured to recognize a proprietary vocabulary in the search keyword using the preset proprietary vocabulary recognition model;

an error correction module 24, configured to correct errors of words in the search keyword except the private word;

the searching module 25 is used for searching by using the search keywords after error correction.

The embodiment of the application also provides a non-volatile readable storage medium, where one or more modules (programs) are stored, where the one or more modules are applied to a device, and the device may be caused to execute instructions (instractions) of each method step in the embodiment of the application.

Embodiments of the present application provide one or more machine-readable media having instructions stored thereon that, when executed by one or more processors, cause an electronic device to perform a text processing method as described in one or more of the above embodiments. In this embodiment of the present application, the electronic device includes a server, a gateway, a sub-device, and the sub-device is a device such as an internet of things device.

Embodiments of the present disclosure may be implemented as an apparatus for performing a desired configuration using any suitable hardware, firmware, software, or any combination thereof, which may include a server (cluster), a terminal device, such as an IoT device, or the like.

Fig. 10 schematically illustrates an example apparatus 1300 that may be used to implement various embodiments described herein.

For one embodiment, fig. 10 illustrates an example apparatus 1300 having one or more processors 1302, a control module (chipset) 1304 coupled to at least one of the processor(s) 1302, a memory 1306 coupled to the control module 1304, a non-volatile memory (NVM)/storage 1308 coupled to the control module 1304, one or more input/output devices 1310 coupled to the control module 1304, and a network interface 1312 coupled to the control module 1306.

The processor 1302 may include one or more single-core or multi-core processors, and the processor 1302 may include any combination of general-purpose or special-purpose processors (e.g., graphics processors, application processors, baseband processors, etc.). In some embodiments, the apparatus 1300 can be implemented as a server device such as a gateway or a controller as described in embodiments of the present application.

In some embodiments, the apparatus 1300 may include one or more computer-readable media (e.g., memory 1306 or NVM/storage 1308) having instructions 1314 and one or more processors 1302 combined with the one or more computer-readable media configured to execute the instructions 1314 to implement the modules to perform actions described in this disclosure.

For one embodiment, the control module 1304 may include any suitable interface controller to provide any suitable interface to at least one of the processor(s) 1302 and/or any suitable device or component in communication with the control module 1304.

The control module 1304 may include a memory controller module to provide an interface to the memory 1306. The memory controller modules may be hardware modules, software modules, and/or firmware modules.

Memory 1306 may be used to load and store data and/or instructions 1314 for device 1300, for example. For one embodiment, memory 1306 may include any suitable volatile memory, such as suitable DRAM. In some embodiments, memory 1306 may include double data rate type four synchronous dynamic random access memory (DDR 4 SDRAM).

For one embodiment, the control module 1304 may include one or more input/output controllers to provide interfaces to the NVM/storage 1308 and the input/output device(s) 1310.

For example, NVM/storage 1308 may be used to store data and/or instructions 1314. NVM/storage 1308 may include any suitable nonvolatile memory (e.g., flash memory) and/or may include any suitable nonvolatile storage device(s) (e.g., hard disk drive(s) (HDD), compact disk drive(s) (CD) and/or digital versatile disk drive (s)).

NVM/storage 1308 may include storage resources that are physically part of the device on which apparatus 1300 is installed, or may be accessible by the device without necessarily being part of the device. For example, NVM/storage 1308 may be accessed over a network via input/output device(s) 1310.

Input/output device(s) 1310 may provide an interface for apparatus 1300 to communicate with any other suitable device, input/output device 1310 may include communication components, audio components, sensor components, and the like. The network interface 1312 may provide an interface for the device 1300 to communicate over one or more networks, and the device 1300 may communicate wirelessly with one or more components of a wireless network according to any of one or more wireless network standards and/or protocols, such as accessing a wireless network based on a communication standard, such as WiFi, 2G, 3G, 4G, 5G, etc., or a combination thereof.

For one embodiment, at least one of the processor(s) 1302 may be packaged together with logic of one or more controllers (e.g., memory controller modules) of the control module 1304. For one embodiment, at least one of the processor(s) 1302 may be packaged together with logic of one or more controllers of the control module 1304 to form a System In Package (SiP). For one embodiment, at least one of the processor(s) 1302 may be integrated on the same mold as logic of one or more controllers of the control module 1304. For one embodiment, at least one of the processor(s) 1302 may be integrated on the same die with logic of one or more controllers of the control module 1304 to form a system on chip (SoC).

In various embodiments, apparatus 1300 may be, but is not limited to being: a server, a desktop computing device, or a mobile computing device (e.g., a laptop computing device, a handheld computing device, a tablet, a netbook, etc.), among other terminal devices. In various embodiments, the apparatus 1300 may have more or fewer components and/or different architectures. For example, in some embodiments, apparatus 1300 includes one or more cameras, a keyboard, a Liquid Crystal Display (LCD) screen (including a touch screen display), a non-volatile memory port, multiple antennas, a graphics chip, an Application Specific Integrated Circuit (ASIC), and a speaker.

The embodiment of the application provides electronic equipment, which comprises: one or more processors; and one or more machine readable media having instructions stored thereon, which when executed by the one or more processors, cause the processors to perform the text processing method as described in one or more of the embodiments of the present application.

For the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.

Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present embodiments have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the present application.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or terminal device comprising the element.

The foregoing has described in detail a text processing method device provided in the present application, and specific examples have been applied herein to illustrate the principles and embodiments of the present application, where the foregoing examples are provided to assist in understanding the method and core idea of the present application; meanwhile, as those skilled in the art will have modifications in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims

1. A method of text processing, the method comprising:

acquiring a target text;

determining the determined special vocabulary in the target text as a correct vocabulary;

the identifying the private vocabulary in the target text by using the preset private vocabulary identifying model comprises the following steps:

detecting whether a traditional Chinese character exists in the target text;

Determining the special vocabulary in the converted target text by using the preset special vocabulary recognition model;

the method further comprises the steps of:

2. The method according to claim 1, wherein the method further comprises:

3. The method according to claim 2, wherein the method further comprises:

4. The method of claim 3, wherein the determining candidate words for the non-private word from the pinyin comprises:

candidate words are determined in the word string.

5. The method according to claim 2, wherein the method further comprises:

6. The method of claim 5, wherein said determining candidate words for said non-proprietary word from said chinese character encoding comprises:

7. The method according to claim 2, wherein the method further comprises:

if the non-private vocabulary does not exist in the custom vocabulary set, the shape near vocabulary of the non-private vocabulary is obtained and is used as a candidate vocabulary;

8. The method of claim 7, wherein the obtaining the shape-near vocabulary of the non-proprietary vocabulary comprises:

9. The method according to any one of claims 3-8, further comprising:

10. A method of searching, the method comprising:

acquiring a search keyword input in a search box;

searching by using the search keywords after error correction;

The identifying the proprietary vocabulary in the search keyword by using the preset proprietary vocabulary identification model comprises the following steps:

detecting whether a traditional Chinese character exists in the search keywords;

if the complex characters exist in the search keywords, converting the complex characters in the search keywords into corresponding simplified characters;

determining the proprietary vocabulary in the converted search keywords by using the preset proprietary vocabulary recognition model;

the method further comprises the steps of:

determining the words except the special words in the search keywords as non-special words, wherein the non-special words comprise wrong words;

and replacing the non-proprietary vocabulary with the correct vocabulary in the search keyword.

11. A text processing apparatus, the apparatus comprising:

the first acquisition module is used for acquiring target texts;

the first determining module is used for determining the determined special vocabulary in the target text as a correct vocabulary;

the first identification module includes:

the first determining unit is used for determining the special vocabulary in the converted target text by using the preset special vocabulary recognition model;

the apparatus further comprises:

12. The apparatus of claim 11, wherein the apparatus further comprises:

13. The apparatus of claim 12, wherein the apparatus further comprises:

14. The apparatus of claim 13, wherein the fifth determination module comprises:

15. The apparatus of claim 12, wherein the apparatus further comprises:

16. The apparatus of claim 15, wherein the eighth determination module comprises:

17. The apparatus of claim 12, wherein the apparatus further comprises:

a fourth obtaining module, configured to obtain a shape near vocabulary of the non-private vocabulary and serve as a candidate vocabulary if the non-private vocabulary does not exist in the custom vocabulary set;

18. The apparatus of claim 17, wherein the fourth acquisition module is specifically configured to: and searching the shape near vocabulary corresponding to the non-proprietary vocabulary in a fourth corresponding relation between the shape near vocabulary of the basic vocabulary and the basic vocabulary by taking the non-proprietary vocabulary as the basic vocabulary.

19. The apparatus according to any one of claims 13-18, wherein the apparatus further comprises:

20. A search apparatus, the apparatus comprising:

the searching module is used for searching by using the search keywords after error correction;

the second identification module includes:

the detection unit is used for detecting whether the traditional Chinese characters exist in the search keywords;

the modification unit is used for converting the traditional Chinese characters in the search keywords into corresponding simplified Chinese characters if the traditional Chinese characters exist in the search keywords;

a first determining unit, configured to determine a proprietary vocabulary in the converted search keyword using the preset proprietary vocabulary recognition model;

the apparatus further comprises:

a second determining module, configured to determine, as a non-proprietary vocabulary, a vocabulary in the search keyword other than the proprietary vocabulary, where the non-proprietary vocabulary includes an erroneous vocabulary;

And the first replacing module is used for replacing the non-proprietary vocabulary with the correct vocabulary in the search keyword.