CN111967248A

CN111967248A - Pinyin identification method and device, terminal equipment and computer readable storage medium

Info

Publication number: CN111967248A
Application number: CN202010656131.0A
Authority: CN
Inventors: 赵洋; 包荣鑫; 王宇; 魏世胜
Original assignee: Shenzhen Valueonline Technology Co ltd
Current assignee: Shenzhen Valueonline Technology Co ltd
Priority date: 2020-07-09
Filing date: 2020-07-09
Publication date: 2020-11-20

Abstract

The application is applicable to the technical field of data processing, and provides a pinyin identification method, a pinyin identification device, terminal equipment and a computer-readable storage medium, wherein the pinyin identification method comprises the following steps: dividing Chinese syllables of a pinyin sequence to be recognized to obtain a syllable sequence; acquiring a group of candidate Chinese characters corresponding to each Chinese syllable in the syllable sequence, and determining a target Chinese character corresponding to each Chinese syllable from a group of candidate characters corresponding to each Chinese syllable based on a preset statistical model, wherein the preset statistical model is used for representing the relevance between the candidate Chinese characters corresponding to each two Chinese syllables; and combining the target Chinese characters into sentences according to the sequence of the Chinese syllables in the syllable sequence, and recording the sentences as the recognition result of the pinyin sequence. By the method, the accuracy of pinyin identification can be effectively improved.

Description

Pinyin identification method and device, terminal equipment and computer readable storage medium

Technical Field

The present application belongs to the field of data processing technology, and in particular, relates to a pinyin identification method, a pinyin identification device, a terminal device, and a computer-readable storage medium.

Background

The pinyin identification technology is a method for identifying corresponding Chinese characters according to input pinyin. With the development of artificial intelligence, pinyin identification technology has been widely applied in the fields of input methods, speech recognition, text enhancement, machine translation, and the like.

The existing pinyin identification technology generally needs to construct a word bank containing pinyin, words and weights, then combine the words, score each word combination according to the weights, and record the optimal word combination as a pinyin identification result. The existing method needs to continuously update the word stock, if the word stock is not updated timely, the finally obtained word combination cannot express the true semantics of the original pinyin, and the accuracy of the pinyin identification result is further influenced.

Disclosure of Invention

The embodiment of the application provides a pinyin identification method, a pinyin identification device, terminal equipment and a computer readable storage medium, which can solve the problem of inaccurate pinyin identification result.

In a first aspect, an embodiment of the present application provides a pinyin identification method, including:

dividing Chinese syllables of a pinyin sequence to be recognized to obtain a syllable sequence;

acquiring a group of candidate Chinese characters corresponding to each Chinese syllable in the syllable sequence, and determining a target Chinese character corresponding to each Chinese syllable from a group of candidate characters corresponding to each Chinese syllable based on a preset statistical model, wherein the preset statistical model is used for representing the relevance between the candidate Chinese characters corresponding to each two Chinese syllables;

and combining the target Chinese characters into sentences according to the sequence of the Chinese syllables in the syllable sequence, and recording the sentences as the recognition result of the pinyin sequence.

In a possible implementation manner of the first aspect, the dividing, by the pinyin sequence to be recognized, the chinese syllables to obtain a syllable sequence includes:

acquiring a preset dictionary tree, wherein each node in the preset dictionary tree corresponds to one pinyin character, and the pinyin characters corresponding to child nodes of the node are different;

searching a first target node in the preset dictionary tree, wherein the pinyin character corresponding to the first target node is the same as the first pinyin character in the pinyin sequence;

and searching a second target node in the preset dictionary tree by taking the first target node as a father node, wherein the R-th second target node corresponds to the R + 1-th pinyin character in the pinyin sequence, R is more than or equal to 1 and is less than or equal to R-1, and R is the number of the pinyin characters in the pinyin sequence.

In a possible implementation manner of the first aspect, the preset statistical model is a hidden markov model;

the hidden Markov model comprises an initial probability matrix, a state transition matrix and an observation matrix;

the initial probability matrix comprises a statistical probability value corresponding to each candidate Chinese character in a first Chinese character group, wherein the first Chinese character group is a group of candidate Chinese characters corresponding to a first Chinese syllable in the syllable sequence;

the state transition matrix comprises association probability values between candidate Chinese characters corresponding to every two adjacent Chinese syllables in the syllable sequence;

the observation matrix comprises the statistical probability value corresponding to the pronunciation of each candidate Chinese character.

In a possible implementation manner of the first aspect, the determining, based on a preset statistical model, a target chinese character corresponding to each chinese syllable from a set of candidate characters corresponding to each chinese syllable includes:

sequentially calculating the probability maximum value of each candidate Chinese character corresponding to each Chinese syllable according to the sequence of the Chinese syllables in the syllable sequence;

and determining the target Chinese character corresponding to each Chinese syllable according to the calculated probability maximum value.

In a possible implementation manner of the first aspect, the sequentially calculating a probability maximum value of each candidate chinese character corresponding to each chinese syllable according to an order of the chinese syllables in the syllable sequence includes:

for said first group of Chinese characters, by formula P₁(i)＝π_iB_i(O₁) Calculating the probability maximum value of each candidate Chinese character in the first Chinese character group;

wherein, P₁(i) The probability maximum value, pi, of the ith candidate Chinese character in the first Chinese character group_iIs the statistical probability value of the ith candidate Chinese character of the first Chinese character group in the initial probability matrix, B_i(O₁) The pronunciation O of the ith candidate Chinese character of the first Chinese character group in the observation matrix₁Corresponding statistical probability value, O₁Corresponding to the first Chinese syllable in the syllable sequence, i is 1, … M₁，M₁And the number of the candidate Chinese characters in the first Chinese character group is determined.

for the nth Chinese character group, by formula

Calculating the probability maximum value of each candidate Chinese character in the nth Chinese character group;

looking up the P in the (n-1) th group of Chinese characters_n(j) Corresponding candidate Chinese characters, and combining the P in the n-1 th Chinese character group_n(j) The corresponding candidate Chinese characters are associated with the jth candidate Chinese character in the nth Chinese character group;

wherein, P_n(j) Is the probability maximum value, P, of the jth candidate Chinese character in the nth Chinese character group_n-1(h) Is the probability maximum value of the h candidate Chinese character in the n-1 th Chinese character group, A_hjFor the association summary between the jth candidate Chinese character in the nth Chinese character group and the h candidate Chinese character in the n-1 th Chinese character group in the state transition matrixValue of the rate, B_j(O_n) The pronunciation O of the j candidate Chinese character of the n character group in the observation matrix_nCorresponding statistical probability value, O_nIs consistent with the nth Chinese syllable in the syllable sequence, j is 1, … M_n，M_nH is 1, … M which is the number of candidate Chinese characters in the nth Chinese character group_n-1，M_n-1And N is more than 1 and less than or equal to N, wherein N is the number of Chinese syllables in the syllable sequence.

In a possible implementation manner of the first aspect, the determining, according to the calculated probability maxima, a target chinese character corresponding to each of the chinese syllables includes:

for the Nth Chinese syllable in the syllable sequence, recording the candidate Chinese character corresponding to the probability maximum value with the maximum numerical value in the Nth Chinese character group as the target Chinese character corresponding to the Nth Chinese syllable;

and for the kth Chinese syllable in the syllable sequence, marking the candidate Chinese character in the kth Chinese character group, which is associated with the target Chinese character corresponding to the (k + 1) th Chinese syllable in the syllable sequence, as the target Chinese character corresponding to the kth Chinese syllable, wherein k is more than or equal to 1 and is less than N.

In a second aspect, an embodiment of the present application provides a pinyin identification device, including:

the syllable dividing unit is used for dividing Chinese syllables of the pinyin sequence to be identified to obtain a syllable sequence;

a Chinese character determining unit, configured to obtain a set of candidate Chinese characters corresponding to each Chinese syllable in the syllable sequence, and determine a target Chinese character corresponding to each Chinese syllable from a set of candidate characters corresponding to each Chinese syllable based on a preset statistical model, where the preset statistical model is used to represent a correlation between the candidate Chinese characters corresponding to each two Chinese syllables;

and the recognition result unit is used for combining the target Chinese characters into sentences according to the sequence of the Chinese syllables in the syllable sequence and recording the sentences as the recognition result of the pinyin sequence.

In a third aspect, an embodiment of the present application provides a terminal device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor, when executing the computer program, implements the pinyin identification method according to any one of the first aspect.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, and the embodiment of the present application provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, where the computer program is executed by a processor to implement the pinyin identification method according to any one of the first aspect.

In a fifth aspect, an embodiment of the present application provides a computer program product, which, when running on a terminal device, causes the terminal device to execute the pinyin identification method according to any one of the first aspect.

It is understood that the beneficial effects of the second aspect to the fifth aspect can be referred to the related description of the first aspect, and are not described herein again.

Compared with the prior art, the embodiment of the application has the advantages that:

in the embodiment of the application, firstly, dividing treatment of Chinese syllables is carried out on a pinyin sequence to be identified to obtain a syllable sequence; then obtaining a group of candidate Chinese characters corresponding to each Chinese syllable in the syllable sequence, and determining a target Chinese character corresponding to each Chinese syllable from a group of candidate characters corresponding to each Chinese syllable based on a preset statistical model; because the preset statistical model is used for representing the relevance between the candidate Chinese characters corresponding to every two Chinese syllables, the target Chinese character determined based on the preset statistical model is the Chinese character which is most consistent with the pinyin sequence semantics; and finally combining the target Chinese characters into sentences according to the sequence of the Chinese syllables in the syllable sequence, and recording the sentences as the recognition result of the pinyin sequence. By the method, the accuracy of pinyin identification can be effectively improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

FIG. 1 is a schematic flow chart of a pinyin identification method according to an embodiment of the present application;

FIG. 2 is a schematic flow chart illustrating a method for determining a target Chinese character according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a device provided in an embodiment of the present application;

fig. 4 is a schematic structural diagram of a terminal device provided in an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

Furthermore, in the description of the present application and the appended claims, the terms "first," "second," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.

Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise.

Referring to fig. 1, which is a schematic flow chart of a pinyin identification method provided in an embodiment of the present application, by way of example and not limitation, the method may include the following steps:

s101, dividing Chinese syllables of the pinyin sequence to be recognized to obtain a syllable sequence.

The dividing process here refers to dividing the pinyin sequence into a plurality of chinese syllables that form a syllable sequence in the order of the pinyin characters in the pinyin sequence. For example: the pinyin sequence is "jiazhizaixian", and the syllable sequences after division processing are "jia", "zhi", "zai" and "xian".

Because one Chinese syllable can include 2 pinyin characters or more than 2 pinyin characters, the syllable sequence corresponding to the pinyin sequence can have various conditions. For example: the first three pinyin characters "jia" in the pinyin sequence shown above may be divided into the same chinese syllable "jia", or may be divided into two chinese syllables "ji" and "a". For another example, the last 4 pinyin characters "xian" in the pinyin sequence shown above may be divided into the same chinese syllable "xian", or may be divided into two chinese syllables "xi" and "an".

Optionally, one implementation manner of the dividing process in step S101 may include the following steps:

I. the method comprises the steps of obtaining a preset dictionary tree, wherein each node in the preset dictionary tree corresponds to one pinyin character, and the pinyin characters corresponding to child nodes of the nodes are different.

The nodes of the preset dictionary tree are not key values, but are determined by the values of father nodes in the tree, and the child nodes and the grandchild nodes of the same node have common prefixes. And traversing all father nodes from the root node, wherein the combination of all connected characters is used as the character string key value stored by the current root node. Due to the existence of the common prefix, the comparison process of partial meaningless character strings can be effectively reduced.

II. And searching a first target node in the preset dictionary tree, wherein the pinyin character corresponding to the first target node is the same as the first pinyin character in the pinyin sequence.

And III, taking the first target node as a father node, and searching a second target node in the preset dictionary tree, wherein the R-th second target node corresponds to the R + 1-th pinyin character in the pinyin sequence, R is more than or equal to 1 and less than or equal to R-1, and R is the number of the pinyin characters in the pinyin sequence.

For example, assuming that the pinyin sequence is "jia" and the first pinyin character is "j", the first target node that is the same as "j" is looked up in the predetermined dictionary tree. The second pinyin character is 'i', the first target node is taken as a father node, child nodes which are the same as 'i' in the child nodes of the first target node are searched, and the child nodes are marked as 1 st second target node. The third pinyin character is 'a', the child nodes (namely the grandchild nodes of the first target node) which are the same as 'a' in the child nodes of the 1 st second target node are searched, and the child nodes are marked as the 2 nd second target node.

In the process of searching for the second target node in the preset dictionary tree, it needs to be ensured that all the child nodes are valid, and each second target node and its parent node can form a valid chinese syllable.

S102, a group of candidate Chinese characters corresponding to each Chinese syllable in the syllable sequence is obtained, and a target Chinese character corresponding to each Chinese syllable is determined from a group of candidate characters corresponding to each Chinese syllable based on a preset statistical model.

The preset statistical model is used for representing that the relevance between the candidate Chinese characters corresponding to every two Chinese syllables is optional, and the preset statistical model is a hidden Markov model.

The hidden markov model includes 2 state sets, a set of observable states and a set of unobservable states. The data in the observable state set is directly observable, and the data in the unobservable state set is not directly observable. In the embodiment of the application, the data in the observable state set is the Chinese syllables in the syllable sequence, and the data in the unobservable state set is the candidate Chinese characters corresponding to the Chinese syllables. In step S102, a process of determining a target chinese character corresponding to each of the chinese syllables from a set of candidate characters corresponding to each of the chinese syllables based on a preset statistical model is substantially a process of obtaining an optimal combination relationship between the candidate chinese characters according to a syllable sequence.

The hidden markov model also includes an initial probability matrix, a state transition matrix, and an observation matrix.

In an embodiment of the application, the initial probability matrix includes a statistical probability value corresponding to each candidate chinese character in a first chinese character group, where the first chinese character group is a group of candidate chinese characters corresponding to a first chinese syllable in the syllable sequence. For example: the first Chinese syllable in the syllable sequence is jia, the corresponding candidate Chinese characters have 'plus', 'home' and 'price', the initial probability matrix includes the statistical probability value corresponding to 'plus' character, the statistical probability value corresponding to 'home' character and the statistical probability value corresponding to 'price' character. It should be noted that the above is only an example, and in practical application, the method is not used to limit candidate chinese characters corresponding to chinese bytes.

Optionally, the establishing process of the initial probability matrix may be: the method comprises the steps of obtaining a pre-established corpus, finding out all initial Chinese characters in the corpus, counting the occurrence times of each initial Chinese character, calculating the logarithm of the occurrence times, and recording the logarithm as the statistical probability value of the initial Chinese characters. And if a Chinese character never appears at the initial position, the statistical probability value corresponding to the Chinese character is 0.

The corpora in the corpus can be captured from web pages or some databases. For example: an article is captured from a webpage, word segmentation processing is carried out on sentences in the article, and all obtained words can be added into a corpus. The richer the corpus is, the more accurate the resulting hidden markov model is.

In the embodiment of the application, the state transition matrix comprises association probability values between candidate Chinese characters corresponding to every two adjacent Chinese syllables in the syllable sequence. The associated probability values are used to indicate the transition probability of transitioning from one Chinese character to another.

Optionally, the establishment process of the state transition matrix may be: for each Chinese character in the initial probability matrix, marking the Chinese character as a front-position Chinese character, counting the rear-position Chinese characters (namely the Chinese characters positioned at the rear side of the Chinese character) of the front-position Chinese characters in the corpus, counting the logarithm of the times of each rear-position Chinese character appearing behind the front-position Chinese character, and taking the logarithm corresponding to each rear-position Chinese character as the association probability value between the front-position Chinese character and the rear-position Chinese character. For example: the front Chinese character is 'one', the rear Chinese characters of 'one' in the corpus are 'one' and 'a few', the logarithms of the times of occurrence of 'one' and 'a few' are counted respectively, the logarithm corresponding to 'one' is recorded as the association probability value between 'one' and 'one', and the logarithm corresponding to 'a few' is recorded as the association probability value between 'one' and 'a few'.

In the embodiment of the application, the observation matrix comprises the statistical probability value corresponding to the pronunciation of each candidate Chinese character. In practical applications, a Chinese character may be a polyphonic character, corresponding to a plurality of pronunciations. The observation matrix can be used to distinguish the case of polyphones.

Optionally, the establishing process of the observation matrix may be: and counting the various pronunciations of each Chinese character in the corpus and the occurrence times of each pronunciation, and recording the logarithm of the times as a statistical probability value corresponding to the pronunciation.

S103, combining the target Chinese characters into sentences according to the sequence of the Chinese syllables in the syllable sequence, and recording the sentences as the recognition result of the pinyin sequence.

In the embodiment of the application, the preset statistical model is used for representing the relevance between the candidate Chinese characters corresponding to every two Chinese syllables, so that the target Chinese character determined based on the preset statistical model is the Chinese character which best accords with the pinyin sequence semantics; and finally combining the target Chinese characters into sentences according to the sequence of the Chinese syllables in the syllable sequence, and recording the sentences as the recognition result of the pinyin sequence. By the method, the accuracy of pinyin identification can be effectively improved.

Referring to fig. 2, which is a schematic flowchart of a method for determining a target chinese character according to an embodiment of the present application, by way of example and not limitation, in step S102, the method for determining a target chinese character corresponding to each chinese syllable from a set of candidate characters corresponding to each chinese syllable based on a preset statistical model may include the following steps:

s201, sequentially calculating the probability maximum value of each candidate Chinese character corresponding to each Chinese syllable according to the sequence of the Chinese syllables in the syllable sequence.

When calculating the probability maximum of each candidate Chinese character, the following two cases can be classified:

case one, for the first Hanzi group

By the formula P₁(i)＝π_iB_i(O₁) Calculating the probability maximum value of each candidate Chinese character in the first Chinese character group;

And in the case of the nth Chinese character group, N is more than 1 and less than or equal to N, and N is the number of Chinese syllables in the syllable sequence.

By the formula

Calculating the probability maximum value of each candidate Chinese character in the nth Chinese character group; looking up the P in the (n-1) th group of Chinese characters_n(j) Corresponding candidate Chinese characters, and combining the P in the n-1 th Chinese character group_n(j) And associating the corresponding candidate Chinese character with the jth candidate Chinese character in the nth Chinese character group.

Wherein, P_n(j) Is the probability maximum value, P, of the jth candidate Chinese character in the nth Chinese character group_n-1(h) Is the probability maximum value of the h candidate Chinese character in the n-1 th Chinese character group, A_hjThe association probability value B between the jth candidate Chinese character in the nth Chinese character group and the h candidate Chinese character in the n-1 th Chinese character group in the state transition matrix_j(O_n) The pronunciation O of the j candidate Chinese character of the n character group in the observation matrix_nCorresponding statistical probability value, O_nIs consistent with the nth Chinese syllable in the syllable sequence, j is 1, … M_n，M_nH is 1, … M which is the number of candidate Chinese characters in the nth Chinese character group_n-1，M_n-1And N is more than 1 and less than or equal to N, wherein N is the number of Chinese syllables in the syllable sequence.

For example, it is assumed that the candidate chinese characters of the first chinese character group in the syllable sequence have "system", "west" and "xi", the candidate chinese characters of the second chinese character group have "system", "open" and "same", and the candidate chinese characters of the third chinese character group have "change", "talk" and "hua". The initial probability matrix is pi ═ 0.6, 0.2]^TThe state transition matrix is

Observation matrix in B₁(O₁)＝0.5，B₃(O₃) 0.5, where the "department" and "Hua" are polyphonic words and the statistical probabilities of the pronunciations corresponding to the remaining candidate words are all 1.

For the first group of Chinese characters, P₁(1)＝π₁B₁(O₁)＝0.6×0.5＝0.3，P₁(2)＝π₂B₂(O₁)＝0.2×1＝0.2，P₁(3)＝π₃B₃(O₁)＝0.2×1＝0.2。

For the second group of Chinese characters, P₂(1)＝max_1≤h≤3[P₁(h)A_h1]B₁(O₂) Max [0.3 × 0.8, 0.2 × 0.1, 0.2 × 0.1 × 0.5 ═ 0.12, where the maximum value of P1hAh1 is 0.3 × 0.8, where 0.3 corresponds to the "his" in the first group of characters, and therefore P₂(1) Associating with the corresponding candidate Chinese character 'xi' in the first Chinese character group; by analogy with one another, P₂(2)＝0.21，P₂(2) Associating with the corresponding candidate Chinese character 'xi' in the first Chinese character group; p₂(3)＝0.21，P₂(3) And associating with the corresponding candidate Chinese character 'xi' in the first Chinese character group.

For the third group of Chinese characters, P₃(1)＝max_1≤h≤3[P₂(h)A_h1]B₁(O₃) Max [0.12 × 0.8, 0.21 × 0.1, 0.21 × 0.1 × 1 is 0.096, where P2hAh1 is 0.12 × 0.8 at its maximum, where 0.12 corresponds to the "system" in the second group of chinese characters, and thus P is the number P₃(1) Associating with the corresponding candidate Chinese character system in the second Chinese character group; by analogy with one another, P₃(2)＝0.084，P₃(2) Associating with the corresponding candidate Chinese character system in the second Chinese character group; p₃(3)＝0.042，P₃(3) And associating with the corresponding candidate Chinese character system in the second Chinese character group.

And S202, determining the target Chinese character corresponding to each Chinese syllable according to the calculated probability maximum value.

Optionally, only one set of sentences that best matches the pinyin sequence may be determined, and specifically, one implementation manner of step S202 is:

1) and for the Nth Chinese syllable in the syllable sequence, recording the candidate Chinese character corresponding to the probability maximum value with the maximum numerical value in the Nth Chinese character group as the target Chinese character corresponding to the Nth Chinese syllable.

2) And for the kth Chinese syllable in the syllable sequence, marking the candidate Chinese character in the kth Chinese character group, which is associated with the target Chinese character corresponding to the (k + 1) th Chinese syllable in the syllable sequence, as the target Chinese character corresponding to the kth Chinese syllable, wherein k is more than or equal to 1 and is less than N.

Continuing with the example in step S201, the maximum probability in the third chinese character group is 0.084, and the corresponding candidate chinese character is "change", i.e., "change" is the target chinese character corresponding to the 3 rd chinese syllable. "systemization" is associated with "systematism" in the second Chinese character group, and "systematism" is associated with "systematism" in the first Chinese character group, so that the three target Chinese characters finally obtained are "systematization", and "materialization", respectively.

In the above way, a sentence which is most matched with the syllable sequence can be dynamically selected. In addition, the probability maximum value of each candidate Chinese character is calculated in the mode, which is equivalent to neglecting the sentences with lower probability in various combined sentences, so that the calculation time of the algorithm is saved, and the accuracy of the algorithm is improved.

In the above manner, only one group of sentences which are most matched is selected, and of course, in practical application, optionally, all sentences which are matched with the pinyin sequence can be determined for the user to select.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

Corresponding to the method described in the above embodiment, fig. 3 is a block diagram of a pinyin identification device provided in an embodiment of the present application, and for convenience of explanation, only the parts related to the embodiment of the present application are shown.

Referring to fig. 3, the apparatus includes:

a syllable dividing unit 31, configured to perform division processing of chinese syllables on the pinyin sequence to be identified, to obtain a syllable sequence;

a Chinese character determining unit 32, configured to obtain a set of candidate Chinese characters corresponding to each Chinese syllable in the syllable sequence, and determine a target Chinese character corresponding to each Chinese syllable from a set of candidate characters corresponding to each Chinese syllable based on a preset statistical model, where the preset statistical model is used to represent a correlation between the candidate Chinese characters corresponding to each two Chinese syllables;

and the recognition result unit 33 is configured to combine the target chinese characters into a sentence according to the order of the chinese syllables in the syllable sequence, and record the sentence as the recognition result of the pinyin sequence.

Alternatively, the syllable dividing unit 31 includes:

and the dictionary tree acquisition module is used for acquiring a preset dictionary tree, wherein each node in the preset dictionary tree corresponds to one pinyin character, and the pinyin characters corresponding to child nodes of the node are different.

And the first node searching module is used for searching a first target node in the preset dictionary tree, wherein the pinyin character corresponding to the first target node is the same as the first pinyin character in the pinyin sequence.

And the second node searching module is used for searching a second target node in the preset dictionary tree by taking the first target node as a father node, wherein the R-th second target node corresponds to the R + 1-th pinyin character in the pinyin sequence, R is more than or equal to 1 and less than or equal to R-1, and R is the number of the pinyin characters in the pinyin sequence.

Optionally, the preset statistical model is a hidden markov model; the hidden Markov model comprises an initial probability matrix, a state transition matrix and an observation matrix; the initial probability matrix comprises a statistical probability value corresponding to each candidate Chinese character in a first Chinese character group, wherein the first Chinese character group is a group of candidate Chinese characters corresponding to a first Chinese syllable in the syllable sequence; the state transition matrix comprises association probability values between candidate Chinese characters corresponding to every two adjacent Chinese syllables in the syllable sequence; the observation matrix comprises the statistical probability value corresponding to the pronunciation of each candidate Chinese character.

Optionally, the chinese character determining unit 32 includes:

and the calculation module is used for sequentially calculating the probability maximum value of each candidate Chinese character corresponding to each Chinese syllable according to the sequence of the Chinese syllables in the syllable sequence.

And the determining module is used for determining the target Chinese character corresponding to each Chinese syllable according to the calculated probability maximum value.

Optionally, the calculation module is further configured to:

for said first group of Chinese characters, by formula P₁(i)＝π_iB_i(O₁) Calculating the probability maximum value of each candidate Chinese character in the first Chinese character group; wherein, P₁(i) The probability maximum value, pi, of the ith candidate Chinese character in the first Chinese character group_iIs the statistical probability value of the ith candidate Chinese character of the first Chinese character group in the initial probability matrix, B_i(O₁) The pronunciation O of the ith candidate Chinese character of the first Chinese character group in the observation matrix₁Corresponding statistical probability value, O₁Corresponding to the first Chinese syllable in the syllable sequence, i is 1, … M₁，M₁And the number of the candidate Chinese characters in the first Chinese character group is determined.

Optionally, the calculation module is further configured to:

for the nth Chinese character group, by formula

looking up the P in the (n-1) th group of Chinese characters_n(j) Corresponding candidate Chinese characters, and combining the P in the n-1 th Chinese character group_n(j) The corresponding candidate Chinese characters are associated with the jth candidate Chinese character in the nth Chinese character group; wherein, P_n(j) Is the probability maximum value, P, of the jth candidate Chinese character in the nth Chinese character group_n-1(h) Is the probability maximum value of the h candidate Chinese character in the n-1 th Chinese character group, A_hjThe association probability value B between the jth candidate Chinese character in the nth Chinese character group and the h candidate Chinese character in the n-1 th Chinese character group in the state transition matrix_j(O_n) The pronunciation O of the j candidate Chinese character of the n character group in the observation matrix_nCorresponding statistical probabilityValue, O_nIs consistent with the nth Chinese syllable in the syllable sequence, j is 1, … M_n，M_nH is 1, … M which is the number of candidate Chinese characters in the nth Chinese character group_n-₁，M_n-₁And N is more than 1 and less than or equal to N, wherein N is the number of Chinese syllables in the syllable sequence.

Optionally, the determining module is further configured to:

It should be noted that, for the information interaction, execution process, and other contents between the above-mentioned devices/units, the specific functions and technical effects thereof are based on the same concept as those of the embodiment of the method of the present application, and specific reference may be made to the part of the embodiment of the method, which is not described herein again.

The apparatus shown in fig. 3 may be a software unit, a hardware unit, or a combination of software and hardware unit built in the existing terminal device, may be integrated into the terminal device as a separate pendant, or may exist as a separate terminal device.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

Fig. 4 is a schematic structural diagram of a terminal device according to an embodiment of the present application. As shown in fig. 4, the terminal device 4 of this embodiment includes: at least one processor 40 (only one shown in fig. 4), a memory 41, and a computer program 42 stored in the memory 41 and operable on the at least one processor 40, wherein the processor 40 executes the computer program 42 to implement the steps of any of the pinyin identification method embodiments described above.

The terminal device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The terminal device may include, but is not limited to, a processor, a memory. Those skilled in the art will appreciate that fig. 4 is merely an example of the terminal device 4, and does not constitute a limitation of the terminal device 4, and may include more or less components than those shown, or combine some components, or different components, such as an input-output device, a network access device, and the like.

The Processor 40 may be a Central Processing Unit (CPU), and the Processor 40 may be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 41 may in some embodiments be an internal storage unit of the terminal device 4, such as a hard disk or a memory of the terminal device 4. In other embodiments, the memory 41 may also be an external storage device of the terminal device 4, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like provided on the terminal device 4. Further, the memory 41 may also include both an internal storage unit and an external storage device of the terminal device 4. The memory 41 is used for storing an operating system, an application program, a BootLoader (BootLoader), data, and other programs, such as program codes of the computer program. The memory 41 may also be used to temporarily store data that has been output or is to be output.

The embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps in the above-mentioned method embodiments.

The embodiments of the present application provide a computer program product, which when running on a terminal device, enables the terminal device to implement the steps in the above method embodiments when executed.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the processes in the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium and can implement the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to an apparatus/terminal device, recording medium, computer Memory, Read-Only Memory (ROM), Random-Access Memory (RAM), electrical carrier wave signals, telecommunications signals, and software distribution medium. Such as a usb-disk, a removable hard disk, a magnetic or optical disk, etc. In certain jurisdictions, computer-readable media may not be an electrical carrier signal or a telecommunications signal in accordance with legislative and patent practice.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other ways. For example, the above-described embodiments of the apparatus/terminal device are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A pinyin identification method, comprising:

2. The pinyin identification method of claim 1, wherein said dividing chinese syllables into syllable sequences for the pinyin sequence to be identified comprises:

3. The pinyin identification method of claim 1, wherein the predetermined statistical model is a hidden markov model;

4. The pinyin identification method as claimed in claim 3, wherein said determining the target chinese character for each said chinese syllable from the set of candidate characters for each said chinese syllable based on a predetermined statistical model includes:

5. The pinyin identification method of claim 4 wherein said sequentially calculating the probability maxima for each of said candidate chinese characters corresponding to each of said chinese syllables in accordance with the order of said chinese syllables in said syllable sequence comprises:

6. The pinyin identification method of claim 4 wherein said sequentially calculating the probability maxima for each of said candidate chinese characters corresponding to each of said chinese syllables in accordance with the order of said chinese syllables in said syllable sequence comprises:

for the nth Chinese character group, by formula

wherein, P_n(j) Is the probability maximum value, P, of the jth candidate Chinese character in the nth Chinese character group_n-1(h) Is the probability maximum value of the h candidate Chinese character in the n-1 th Chinese character group, A_hjThe association probability value B between the jth candidate Chinese character in the nth Chinese character group and the h candidate Chinese character in the n-1 th Chinese character group in the state transition matrix_j(O_n) The pronunciation O of the j candidate Chinese character of the n character group in the observation matrix_nCorresponding statistical probability value, O_nIs consistent with the nth Chinese syllable in the syllable sequence, j is 1, … M_n，M_nH is 1, … M which is the number of candidate Chinese characters in the nth Chinese character group_n-1，M_n-1The number of candidate Chinese characters in the n-1 th Chinese character group is 1<N is less than or equal to N, and N is the number of Chinese syllables in the syllable sequence.

7. The pinyin identification method as claimed in claim 6, wherein said determining the target chinese characters for each said chinese syllable based on said calculated probability maxima includes:

8. A pinyin identification device, comprising:

9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.