CN112579734A

CN112579734A - Pronunciation prediction method and electronic equipment

Info

Publication number: CN112579734A
Application number: CN201910942223.2A
Authority: CN
Inventors: 陈天峰; 冯大航; 陈孝良; 常乐
Original assignee: Beijing SoundAI Technology Co Ltd
Current assignee: Beijing SoundAI Technology Co Ltd
Priority date: 2019-09-30
Filing date: 2019-09-30
Publication date: 2021-03-30

Abstract

The invention provides a pronunciation prediction method and electronic equipment, wherein the method comprises the following steps: acquiring a word to be detected; predicting the word to be detected through a pre-acquired first network to acquire the pronunciation of the word to be detected; wherein the first network is determined from a pronunciation dictionary, the pronunciation dictionary including words and pronunciations corresponding to the words. According to the method provided by the invention, the words which can be predicted by the first network are not limited to the words which are labeled manually, the number of words which can predict pronunciation is increased, and the labor cost is reduced.

Description

Pronunciation prediction method and electronic equipment

Technical Field

The invention relates to the technical field of computers, in particular to a pronunciation prediction method and electronic equipment.

Background

In the scenes of voice recognition, awakening and the like, Chinese and foreign language mixed words, such as foreign language trademarks, names of people, proper nouns and the like, need to be supported. Because the pronunciation units of Chinese and foreign languages are different, different modeling units are usually needed for modeling. The Chinese language model can support foreign language recognition and awakening by adding Chinese pronunciation of foreign language words in the pure Chinese language model by utilizing the similarity between Chinese and foreign language pronunciations.

At present, a pronunciation dictionary is generated by adopting a manual marking mode to obtain foreign words and Chinese pronunciations of the foreign words. However, the pronunciation dictionary obtained by manual labeling is limited in capacity, and for foreign words that do not appear in the pronunciation dictionary, Chinese pronunciations of the foreign words cannot be obtained from the pronunciation dictionary.

Disclosure of Invention

The embodiment of the invention provides a pronunciation prediction method and electronic equipment, which aim to solve the problems that the capacity of a pronunciation dictionary obtained by manual marking is limited, and the number of foreign words corresponding to Chinese pronunciation can be obtained is small.

In order to solve the technical problem, the invention is realized as follows:

in a first aspect, an embodiment of the present invention provides a pronunciation prediction method applied to an electronic device, including:

acquiring a word to be detected;

predicting the word to be detected through a pre-acquired first network to acquire the pronunciation of the word to be detected;

wherein the first network is determined from a pronunciation dictionary, the pronunciation dictionary including words and pronunciations corresponding to the words.

In a second aspect, an embodiment of the present invention further provides an electronic device, including:

the acquisition module is used for acquiring the word to be detected;

the prediction module is used for predicting the word to be detected through a pre-acquired first network to acquire the pronunciation of the word to be detected;

In a third aspect, an embodiment of the present invention further provides an electronic device, which includes a processor, a memory, and a computer program stored on the memory and executable on the processor, where the computer program, when executed by the processor, implements the steps of the pronunciation prediction method.

In a fourth aspect, the embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when being executed by a processor, the computer program implements the steps of the pronunciation prediction method.

In the embodiment of the invention, a word to be detected is obtained; predicting the word to be detected through a pre-acquired first network to acquire the pronunciation of the word to be detected; wherein the first network is determined from a pronunciation dictionary, the pronunciation dictionary including words and pronunciations corresponding to the words. Therefore, the number of words formed by the characters determined according to the pronunciation dictionary is larger than the number of words labeled manually, the words predictable by the first network are not limited to the words labeled manually, the number of words predictable pronounces is increased, the labor cost is reduced, and the prediction efficiency is improved.

Drawings

FIG. 1 is a flowchart of a pronunciation prediction method provided by an embodiment of the present invention;

FIG. 2 is a second flowchart of a pronunciation prediction method according to an embodiment of the present invention;

FIG. 3 is a block diagram of an electronic device according to an embodiment of the present invention;

FIG. 4 is a second block diagram of an electronic device according to an embodiment of the present invention;

fig. 5 is a block diagram of an electronic device according to another embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 1 is a flowchart of a pronunciation prediction method according to an embodiment of the present invention, and as shown in fig. 1, the embodiment provides a pronunciation prediction method applied to an electronic device, including the following steps:

step 101, obtaining a word to be detected.

The word to be tested may be a word composed of a plurality of characters, for example, an english word, a german word, a russian word, and the like. The pronunciation of each character in a word constitutes the pronunciation of the word.

Step 102, predicting the word to be tested through a pre-acquired first network to acquire the pronunciation of the word to be tested; wherein the first network is determined from a pronunciation dictionary, the pronunciation dictionary including words and pronunciations corresponding to the words.

The words in the pronunciation dictionary and the pronunciations corresponding to the words can be acquired by manual labeling. The words in the pronunciation dictionary all have corresponding pronunciations. The first network may be determined based on a pronunciation dictionary.

Wherein the acquiring process of the first network comprises:

constructing a prediction network according to the sample character set and the sample pronunciation set, and optimizing the prediction network to obtain a first network; the arc of the prediction network comprises a first input value, a first output value and a pronunciation score value, wherein the first input value is a character in the sample character set, the first output value is a pronunciation in the sample pronunciation set, and the pronunciation score value is a likelihood that the first input value corresponds to the first output value.

In constructing a predicted network, a fence network (i.e., lattice network) may be constructed, where each arc on the network includes a first input value, a first output value, and a pronunciation score value, and in addition, each arc includes a start point and an end point. The first input value is a character in a sample character set, the first output value is a pronunciation in a sample pronunciation set, and the pronunciation score value is a likelihood corresponding to the first input value and the first output value. The greater the likelihood, the greater the likelihood that the first input value (i.e., the character comprised by the arc) corresponds to the first output value (i.e., the pronunciation comprised by the arc), e.g., if the first input value is a first character and the first output value is a first pronunciation, the greater the pronunciation score value, the greater the likelihood that the first character pronounces the first pronunciation. The pronunciation score value in the prediction network may be a preset value or a random value, which is not limited herein. Each path in the prediction network corresponds to a word, namely each path comprises a plurality of arcs, a first input value of the plurality of arcs forms a word, a first output value of the plurality of arcs forms a pronunciation of the word, and the sum of pronunciation score values of the plurality of arcs represents the probability that the word corresponds to the pronunciation formed by the plurality of arcs.

For example, in building a predicted network, these relationships can be expressed by building a fence network (i.e., lattice network): the information included on each arc is: input values, output values, pronunciation score values, and each arc may also include a start point and an end point. The input value is an element in the set a (i.e., the sample character set), the output value is an element in the set B (i.e., the sample pronunciation set), and the pronunciation score value represents a mapping probability of the input value to the output value, and the pronunciation score value may use a random value initially.

Because the pronunciation score value in the prediction network is a preset value or a random value at the beginning, the accuracy is not enough, the prediction network needs to be optimized, the pronunciation score value in the prediction network is adjusted, and the accuracy of the pronunciation score value is improved. The first network may be considered as a network after continuous optimization of pronunciation score values of the predicted network, where the pronunciation score values are more accurate than the predicted network. The arc of the first network includes a first input value, a first output value, and an optimized pronunciation score value.

The pronunciation of the word to be detected is predicted by using the first network constructed by the sample character set and the sample pronunciation set, and the word predictable by the first network is a word formed by the characters in the sample character set, and the characters in the sample character set can be acquired by the words labeled manually, so that the number of the words formed by the characters in the sample character set is greater than that of the words labeled manually, the words predictable by the first network are not limited to the words labeled manually, the number of the words predictable to pronounce is increased, the labor cost is reduced, and the prediction efficiency is increased.

In an embodiment of the present invention, the electronic Device may be a Mobile phone, a Tablet Personal Computer (Tablet Personal Computer), a Laptop Computer (Laptop Computer), a Personal Digital Assistant (PDA), a Mobile Internet Device (MID), a Wearable Device (Wearable Device), or the like.

The pronunciation prediction method of the embodiment of the invention obtains the word to be detected; predicting the word to be detected through a pre-acquired first network to acquire the pronunciation of the word to be detected; wherein the first network is determined from a pronunciation dictionary, the pronunciation dictionary including words and pronunciations corresponding to the words. Because the first network is determined according to the pronunciation dictionary, the words predictable by the first network are words formed by the characters determined by the pronunciation dictionary, and the characters determined by the pronunciation dictionary can be obtained by the words marked manually, so that the number of words formed by the characters determined according to the pronunciation dictionary is larger than the number of words marked manually, the words predictable by the first network are not limited to the words marked manually, the number of words which can be predicted to pronounce is increased, the labor cost is reduced, and the prediction efficiency is improved.

Further, before the constructing a prediction network according to the sample character set and the sample pronunciation set, the obtaining process of the first network further includes:

and obtaining the sample character set and the sample pronunciation set according to a pronunciation dictionary.

Specifically, the words in the pronunciation dictionary and the pronunciations corresponding to the words can be obtained by manual labeling. And obtaining a sample character set and a sample pronunciation set according to the pronunciation dictionary. The characters in the sample character set are obtained after splitting the words. The pronunciations in the sample pronunciation set correspond to the pronunciations of the characters in the sample character set (pronunciation is not necessarily correct, and is represented by a pronunciation score value). One character in the sample character set may correspond to one pronunciation in the sample pronunciation set. And obtaining a pronunciation dictionary by means of manual labeling, and further obtaining a sample character set and a sample pronunciation set, so that the finally obtained first network can predict pronunciations of words which are not included in the pronunciation dictionary except the words included in the pronunciation dictionary.

Further, the optimizing the predicted network to obtain the first network includes:

obtaining an optimal path of the word according to the prediction network, wherein the optimal path comprises each character of the word, pronunciation corresponding to each character, and a first pronunciation score value of each character of the word;

inputting each character of the word and the pronunciation corresponding to each character into a Gaussian mixture model GMM to obtain a second pronunciation score value of each character of the word;

if the difference value between the first pronunciation score value and the second pronunciation score value of each character of the word is not larger than a preset threshold value, determining the prediction network as a first network;

and if the difference value between the first pronunciation score value and the second pronunciation score value of at least one character of the word is larger than a preset threshold value, updating the pronunciation score value of the arc of the prediction network according to the second pronunciation score value, and executing the step of obtaining the optimal path of the word according to the prediction network.

Specifically, the optimal path is a path with the highest pronunciation score value among a plurality of paths corresponding to the word in the prediction network. The pronunciation score value of a path is the sum of the pronunciation score values of the arcs that the path includes. The optimal path includes a plurality of arcs, each arc corresponding to a character of the word, a pronunciation of the character, and a pronunciation score value of the pronunciation corresponding to the character. And combining the first input values of the arcs on the optimal path into words, and combining the first output values of the arcs on the optimal path into pronunciations corresponding to the words.

And inputting each character of the word on the optimal path and the pronunciation corresponding to each character into a Gaussian Mixture Model (GMM for short) to obtain a second pronunciation score value of each character of the word. For example, if the characters of the word include 5 characters, 5 second pronunciation score values are obtained, one for each character.

And if the difference value between the first pronunciation score value and the second pronunciation score value of each character of the word is not greater than a preset threshold value, determining the prediction network as a first network, wherein the difference value is a positive value. And if the first pronunciation score value is less than or equal to the second pronunciation score value, subtracting the first pronunciation score value from the second pronunciation score value. Preferably, the predetermined threshold is 0, i.e. the first pronunciation score value and the second pronunciation score value of each character of the word are equal.

For example, if the characters of the word include 5 characters, the first pronunciation score value and the second pronunciation score value corresponding to the 5 characters are respectively compared, and if the difference between the first pronunciation score value and the second pronunciation score value corresponding to each of the 5 characters is not greater than a preset threshold, the prediction network is determined as the first network.

And if the difference value between the first pronunciation score value and the second pronunciation score value of at least one character of the word is larger than a preset threshold value, updating the pronunciation score value of the arc of the prediction network according to the second pronunciation score value, and executing the step of obtaining the optimal path of the word according to the prediction network. That is, if the difference between the first pronunciation score value and the second pronunciation score value of at least one character of the word is greater than the preset threshold, the pronunciation score value on the arc included in the optimal path of the word in the prediction network is updated to the second pronunciation score value, and step 101 is executed again until the difference between the first pronunciation score value and the second pronunciation score value of each character of the word is not greater than the preset threshold, where the obtained prediction network is the first network.

And inputting each character on the optimal path of the word and the pronunciation corresponding to each character into the Gaussian mixture model through continuous prediction network to obtain a second pronunciation score value, updating the prediction network according to the second pronunciation score value until the prediction network converges, and training the prediction network at the moment to predict the pronunciation. The condition for convergence of the prediction network is that a variation in the pronunciation score value of the arc of the prediction network is small, which can be determined by that neither a difference between the first pronunciation score value and the second pronunciation score value of each character of the word is greater than a preset threshold value.

Further, as shown in fig. 2, step 102, predicting the word to be tested through a pre-acquired first network, and acquiring the pronunciation of the word to be tested, includes:

step 1021, splitting the word to be detected to obtain a character set corresponding to the word to be detected, wherein the sample character set comprises the character set corresponding to the word to be detected;

step 1022, constructing a character network of the word to be detected, wherein a second input value and a second output value corresponding to each arc on the character network are characters in the character set;

step 1023, synthesizing the character network and the first network through an algorithm to obtain a second network, wherein an arc of the second network comprises a third input value, a third output value and a third pronunciation score value, the third input value is a character in a character set corresponding to the word to be detected, the third output value is a pronunciation of the third input value, and the third pronunciation score value is a likelihood corresponding to the third input value and the third output value;

step 1024, acquiring N optimal paths of the word to be detected according to the second network, wherein N is a positive integer;

and 1025, acquiring N pronunciations of the word to be detected according to the N optimal paths.

Specifically, when the word to be detected is predicted, the word to be detected is firstly split, and the splitting mode can be split according to the characters included in the sample character set. For example, if there is no second character in the sample character set, or the probability of the second character is low, the word to be tested may not be split into the second character. For example, if the sample character set includes a/l/e/x/al/ale/le and does not include ex, then for the word alex to be tested, alex may be split into a/l/e/x, al/e/x, ale/x and a/le/x, rather than alex being split into a/l/ex.

And constructing a character network of the word to be detected according to the character set of the word to be detected. And the second input value and the second output value corresponding to each arc on the character network are characters in the character set, so that the word to be detected corresponds to each path of the character network, and a splitting mode of the word to be detected corresponds to one path in the character network.

And synthesizing the character network and the first network through an algorithm to obtain a second network. Specifically, the character network and the first network may be synthesized by a composition algorithm. The arc of the second network includes a third input value, a third output value and a third pronunciation score value, the third input value is a character in the character set corresponding to the word to be tested, the third output value is the pronunciation of the third input value, and the third pronunciation score value is the likelihood corresponding to the third input value and the third output value. That is, the arc of the second network includes pronunciations corresponding to the characters obtained by the multiple splitting modes of the word to be detected and corresponding likelihoods.

By obtaining the N optimal paths of the second network, the N pronunciations corresponding to the word to be detected can be obtained. The N optimal paths may be understood as paths corresponding to the pronunciation score values of the paths in the multiple paths corresponding to the word to be detected in the second network, which are ranked in order from top to bottom and ranked in the top N. The pronunciation score value of a path is the sum of the pronunciation score values of the arcs that the path includes.

Each optimal path of the N optimal paths comprises a plurality of arcs, and each arc corresponds to the character of the word to be tested, the pronunciation of the character and the pronunciation score value of the corresponding pronunciation of the character. And on each optimal path of the N optimal paths, the third input values of the arcs are combined into words, and the third output values of the arcs on the optimal paths are combined into pronunciations corresponding to the words. According to the N optimal paths, N pronunciations of the word to be detected can be obtained, and each optimal path corresponds to one pronunciation of the word to be detected.

In this embodiment, because the N optimal paths of the word to be predicted can be obtained according to the second network, N pronunciations can be obtained to adapt to more scenes, and the prediction efficiency is improved.

The above prediction method will be described below by exemplifying specific examples.

First, a pronunciation dictionary is obtained as a training material by manual labeling, and the pronunciation dictionary is only used for training, so that when predicting, the pronunciation of the formula can be predicted for words not in the pronunciation dictionary in addition to words in the pronunciation dictionary. The method comprises two steps of training and predicting:

training: establishing two sets according to the pronunciation dictionary, wherein the set A (namely a sample character set) stores character combinations; set B (i.e., sample pronunciation set), which holds the pronunciation set. There is a mapping from set a to set B that is the probability that a character combination corresponds to a certain pronunciation combination, and GMM is used to model this distribution.

The Chinese pronunciation of English words is taken as an example for explanation. A lattice network (namely a prediction network) is constructed, and each arc comprises five pieces of information: the system comprises a starting point, an end point, an input, an output and scores, wherein the input is an element in a set A, the output is an element in a set B, the scores represent mapping probabilities of A to B, each path of the network represents a word and a corresponding pronunciation, and the sum of the scores on arcs represents the likelihood that the word corresponds to the pronunciation.

The lattice network is optimized using the maximum Expectation algorithm (EM):

the first step is as follows: and calculating an optimal path of each word through a lattice network, using a pronunciation sequence combination corresponding to the optimal path as a pronunciation combination of the word, obtaining the combination of pronunciation sequences of all words, using the combination as a parameter for observing distribution and updating the GMM, namely inputting the characters on the optimal path and pronunciations of the characters into the GMM, and obtaining pronunciation score values of the characters through the GMM.

The second step is that: after acquiring a new GMM, reconstructing to generate a new lattice network, and repeating the step 1 and the step 2 until convergence, wherein the convergence condition is as follows: the pronunciation score of the character on the optimal path tends to be stable with little or no change. At this time, a lattice network L is constructed from the set a, the set B and the corresponding GMM model parameters (i.e., pronunciation score values) to predict pronunciation.

A prediction step: given a word (which may not be in the pronunciation dictionary), the word is split into all possible combinations according to the set a, for example alex may be split into a/l/e/x, al/e/x, ale/x, etc., the specific splitting manner depends on the elements of the set a, i.e. the characters not in the set a, or the probability of the characters occurring is very low, and the word to be tested may not be split into the characters. For example, if the sample character set includes a/l/e/x/al/ale/le and does not include ex, then for the word alex to be tested, alex may be split into a/l/e/x, al/e/x, ale/x and a/le/x, rather than alex being split into a/l/ex. The character obtained after splitting is used to construct alex into a character network C, the input and output of each arc of the network are characters, and no weight is provided on the arc.

And synthesizing the character network C and the trained network L into a new network D through a compound algorithm, wherein the network D expresses pronunciation combinations corresponding to a plurality of splitting modes of alex and corresponding likelihoods, and an optimal path in the network D is calculated by using a Dijkstra algorithm (namely, a djkstra algorithm), and a pronunciation sequence corresponding to the path is the Chinese pronunciation of the word.

If the word to be detected is fuzzy and a plurality of Chinese pronunciations of the word to be detected are required to be obtained, N optimal paths (namely Nbest paths) are calculated from the network D, and the Nbest results are used as the N Chinese pronunciations of the word.

Compared with manual labeling, the prediction method in the embodiment is more objective and stable in effect; the pronunciation of any word can be predicted, if the word is a rare word, the word is difficult to label manually, and the pronunciation can still be obtained by using the method; a number of the most likely pronunciations can be given, which is very useful in command word recognition.

Referring to fig. 3, fig. 3 is a block diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 3, the electronic device 300 includes:

an obtaining module 301, configured to obtain a word to be detected;

the prediction module 302 is configured to predict the word to be detected through a pre-acquired first network, and acquire a pronunciation of the word to be detected;

Further, the acquiring process of the first network includes:

constructing a prediction network according to a sample character set and a sample pronunciation set, wherein an arc of the prediction network comprises a first input value, a first output value and a pronunciation score value, the first input value is a character in the sample character set, the first output value is a pronunciation in the sample pronunciation set, and the pronunciation score value is a likelihood corresponding to the first input value and the first output value;

and optimizing the prediction network to obtain a first network.

Further, as shown in fig. 5, the prediction module 302 includes:

the splitting submodule 3021 is configured to split the word to be detected, to obtain a character set corresponding to the word to be detected, where the sample character set includes the character set corresponding to the word to be detected;

a constructing submodule 3022, configured to construct a character network of the word to be detected, where a second input value and a second output value corresponding to each arc on the character network are both characters in the character set;

a synthesis submodule 3023, configured to synthesize the character network and the first network through an algorithm to obtain a second network, where an arc of the second network includes a third input value, a third output value, and a third pronunciation score value, where the third input value is a character in a character set corresponding to the word to be tested, the third output value is a pronunciation of the third input value, and the third pronunciation score value is a likelihood that the third input value corresponds to the third output value;

a path obtaining sub-module 3024, configured to obtain N optimal paths of the word to be detected according to the second network, where N is a positive integer;

and the pronunciation obtaining submodule 3025 is configured to obtain N pronunciations of the word to be detected according to the N optimal paths.

The electronic device 300 can implement each process implemented by the electronic device in the embodiment of the method in fig. 1, and is not described herein again to avoid repetition.

The electronic device 300 of the embodiment of the invention acquires the word to be detected; predicting the word to be detected through a pre-acquired first network to acquire the pronunciation of the word to be detected; wherein the first network is determined from a pronunciation dictionary, the pronunciation dictionary including words and pronunciations corresponding to the words. Therefore, the number of words formed by the characters determined according to the pronunciation dictionary is larger than the number of words labeled manually, the words predictable by the first network are not limited to the words labeled manually, the number of words predictable pronounces is increased, the labor cost is reduced, and the prediction efficiency is improved.

Fig. 5 is a schematic diagram of a hardware structure of an electronic device for implementing various embodiments of the present invention, and as shown in fig. 5, the electronic device 400 includes, but is not limited to: radio frequency unit 401, network module 402, audio output unit 403, input unit 404, sensor 405, display unit 406, user input unit 407, interface unit 408, memory 409, processor 410, and power supply 411. Those skilled in the art will appreciate that the electronic device configuration shown in fig. 5 does not constitute a limitation of the electronic device, and that the electronic device may include more or fewer components than shown, or some components may be combined, or a different arrangement of components. In the embodiment of the present invention, the electronic device includes, but is not limited to, a mobile phone, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted terminal, a wearable device, a pedometer, and the like.

The processor 410 is configured to obtain a word to be detected;

Further, the acquiring process of the first network includes:

and optimizing the prediction network to obtain a first network.

Further, the processor 410 is further configured to split the word to be detected, to obtain a character set corresponding to the word to be detected, where the sample character set includes the character set corresponding to the word to be detected;

constructing a character network of the word to be detected, wherein a second input value and a second output value corresponding to each arc on the character network are characters in the character set;

synthesizing the character network and the first network through an algorithm to obtain a second network, wherein an arc of the second network comprises a third input value, a third output value and a third pronunciation score value, the third input value is a character in a character set corresponding to the word to be tested, the third output value is the pronunciation of the third input value, and the third pronunciation score value is the likelihood corresponding to the third input value and the third output value;

acquiring N optimal paths of the word to be detected according to the second network, wherein N is a positive integer;

and acquiring N pronunciations of the word to be detected according to the N optimal paths.

The electronic device 400 can implement the processes implemented by the electronic device in the foregoing embodiments, and in order to avoid repetition, the detailed description is omitted here.

The electronic device 400 of the embodiment of the present invention obtains the word to be tested; predicting the word to be detected through a pre-acquired first network to acquire the pronunciation of the word to be detected; wherein the first network is determined from a pronunciation dictionary, the pronunciation dictionary including words and pronunciations corresponding to the words. Therefore, the number of words formed by the characters determined according to the pronunciation dictionary is larger than the number of words labeled manually, the words predictable by the first network are not limited to the words labeled manually, the number of words predictable pronounces is increased, the labor cost is reduced, and the prediction efficiency is improved.

It should be understood that, in the embodiment of the present invention, the radio frequency unit 401 may be used for receiving and sending signals during a message sending and receiving process or a call process, and specifically, receives downlink data from a base station and then processes the received downlink data to the processor 410; in addition, the uplink data is transmitted to the base station. Typically, radio unit 401 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, and the like. Further, the radio unit 401 can also communicate with a network and other devices through a wireless communication system.

The electronic device provides wireless broadband internet access to the user via the network module 402, such as assisting the user in sending and receiving e-mails, browsing web pages, and accessing streaming media.

The audio output unit 403 may convert audio data received by the radio frequency unit 401 or the network module 402 or stored in the memory 409 into an audio signal and output as sound. Also, the audio output unit 403 may also provide audio output related to a specific function performed by the electronic apparatus 400 (e.g., a call signal reception sound, a message reception sound, etc.). The audio output unit 403 includes a speaker, a buzzer, a receiver, and the like.

The input unit 404 is used to receive audio or video signals. The input Unit 404 may include a Graphics Processing Unit (GPU) 4041 and a microphone 4042, and the Graphics processor 4041 processes image data of a still picture or video obtained by an image capturing apparatus (such as a camera) in a video capturing mode or an image capturing mode. The processed image frames may be displayed on the display unit 406. The image frames processed by the graphic processor 4041 may be stored in the memory 409 (or other storage medium) or transmitted via the radio frequency unit 401 or the network module 402. The microphone 4042 may receive sound, and may be capable of processing such sound into audio data. The processed audio data may be converted into a format output transmittable to a mobile communication base station via the radio frequency unit 401 in case of the phone call mode.

The electronic device 400 also includes at least one sensor 405, such as light sensors, motion sensors, and other sensors. Specifically, the light sensor includes an ambient light sensor that adjusts the brightness of the display panel 4061 according to the brightness of ambient light, and a proximity sensor that turns off the display panel 4061 and/or the backlight when the electronic apparatus 400 is moved to the ear. As one type of motion sensor, an accelerometer sensor can detect the magnitude of acceleration in each direction (generally three axes), detect the magnitude and direction of gravity when stationary, and can be used to identify the posture of an electronic device (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), and vibration identification related functions (such as pedometer, tapping); the sensors 405 may also include a fingerprint sensor, a pressure sensor, an iris sensor, a molecular sensor, a gyroscope, a barometer, a hygrometer, a thermometer, an infrared sensor, etc., which will not be described in detail herein.

The display unit 406 is used to display information input by the user or information provided to the user. The Display unit 406 may include a Display panel 4061, and the Display panel 4061 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like.

The user input unit 407 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device. Specifically, the user input unit 407 includes a touch panel 4071 and other input devices 4072. Touch panel 4071, also referred to as a touch screen, may collect touch operations by a user on or near it (e.g., operations by a user on or near touch panel 4071 using a finger, a stylus, or any suitable object or attachment). The touch panel 4071 may include two parts, a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 410, receives a command from the processor 410, and executes the command. In addition, the touch panel 4071 can be implemented by using various types such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. In addition to the touch panel 4071, the user input unit 407 may include other input devices 4072. Specifically, the other input devices 4072 may include, but are not limited to, a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a track ball, a mouse, and a joystick, which are not described herein again.

Further, the touch panel 4071 can be overlaid on the display panel 4061, and when the touch panel 4071 detects a touch operation thereon or nearby, the touch operation is transmitted to the processor 410 to determine the type of the touch event, and then the processor 410 provides a corresponding visual output on the display panel 4061 according to the type of the touch event. Although in fig. 5, the touch panel 4071 and the display panel 4061 are two independent components to implement the input and output functions of the electronic device, in some embodiments, the touch panel 4071 and the display panel 4061 may be integrated to implement the input and output functions of the electronic device, which is not limited herein.

The interface unit 408 is an interface for connecting an external device to the electronic apparatus 400. For example, the external device may include a wired or wireless headset port, an external power supply (or battery charger) port, a wired or wireless data port, a memory card port, a port for connecting a device having an identification module, an audio input/output (I/O) port, a video I/O port, an earphone port, and the like. The interface unit 408 may be used to receive input (e.g., data information, power, etc.) from an external device and transmit the received input to one or more elements within the electronic apparatus 400 or may be used to transmit data between the electronic apparatus 400 and an external device.

The memory 409 may be used to store software programs as well as various data. The memory 409 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 409 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The processor 410 is a control center of the electronic device, connects various parts of the entire electronic device using various interfaces and lines, performs various functions of the electronic device and processes data by operating or executing software programs and/or modules stored in the memory 409 and calling data stored in the memory 409, thereby performing overall monitoring of the electronic device. Processor 410 may include one or more processing units; preferably, the processor 410 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 410.

The electronic device 400 may further include a power supply 411 (e.g., a battery) for supplying power to various components, and preferably, the power supply 411 may be logically connected to the processor 410 through a power management system, so as to implement functions of managing charging, discharging, and power consumption through the power management system.

In addition, the electronic device 400 includes some functional modules that are not shown, and are not described in detail herein.

Preferably, an embodiment of the present invention further provides an electronic device, which includes a processor 410, a memory 409, and a computer program that is stored in the memory 409 and can be run on the processor 410, and when being executed by the processor 410, the computer program implements each process of the foregoing pronunciation prediction method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not described here again.

The embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements each process of the embodiment of the pronunciation prediction method, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here. The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A pronunciation prediction method applied to electronic equipment is characterized by comprising the following steps:

acquiring a word to be detected;

2. The method of claim 1, wherein the acquiring of the first network comprises:

and optimizing the prediction network to obtain a first network.

3. The method of claim 2, wherein before constructing the predictive network from the sample character set and the sample pronunciation set, the obtaining of the first network further comprises:

4. The method of claim 2, wherein optimizing the predicted network to obtain a first network comprises:

5. The method according to claim 2, wherein the predicting the word to be tested through the pre-obtained first network to obtain the pronunciation of the word to be tested comprises:

splitting the word to be detected to obtain a character set corresponding to the word to be detected, wherein the sample character set comprises the character set corresponding to the word to be detected;

6. An electronic device, comprising:

the acquisition module is used for acquiring the word to be detected;

7. The electronic device of claim 6, wherein the obtaining of the first network comprises:

and optimizing the prediction network to obtain a first network.

8. The electronic device of claim 7, wherein prior to constructing the predictive network from the sample character set and the sample pronunciation set, the obtaining of the first network further comprises:

9. The electronic device of claim 7, wherein optimizing the predicted network to obtain a first network comprises:

10. The electronic device of claim 7, wherein the prediction module comprises:

the splitting submodule is used for splitting the word to be detected to obtain a character set corresponding to the word to be detected, and the sample character set comprises the character set corresponding to the word to be detected;

the building submodule is used for building a character network of the word to be tested, and a second input value and a second output value corresponding to each arc on the character network are characters in the character set;

a synthesis submodule, configured to synthesize the character network and the first network through an algorithm to obtain a second network, where an arc of the second network includes a third input value, a third output value, and a third pronunciation score value, the third input value is a character in a character set corresponding to the word to be tested, the third output value is a pronunciation of the third input value, and the third pronunciation score value is a likelihood that the third input value corresponds to the third output value;

the path obtaining submodule is used for obtaining N optimal paths of the word to be detected according to the second network, wherein N is a positive integer;

and the pronunciation acquisition submodule is used for acquiring N pronunciations of the word to be detected according to the N optimal paths.

11. An electronic device comprising a processor, a memory, and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps of the pronunciation prediction method as claimed in any one of claims 1 to 5.

12. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the pronunciation prediction method as claimed in any one of claims 1 to 5.