CN113157852A

CN113157852A - Voice processing method, system, electronic equipment and storage medium

Info

Publication number: CN113157852A
Application number: CN202110452219.5A
Authority: CN
Inventors: 黄日星
Original assignee: Shenzhen Ubtech Technology Co ltd
Current assignee: Shenzhen Ubtech Technology Co ltd
Priority date: 2021-04-26
Filing date: 2021-04-26
Publication date: 2021-07-23

Abstract

The application is applicable to the technical field of voice recognition, and provides a voice processing method, a voice processing device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring first text information, wherein the first text information is a recognition result of the voice information; extracting phonetic notation characteristics of each vocabulary in the first text information, wherein the phonetic notation characteristics are characteristics of the vocabulary phonetic notation information; if the phonetic notation feature of any vocabulary in the first text information is determined to be in the entity dictionary, the vocabulary is taken as a target vocabulary, and the phonetic notation feature of the target vocabulary is taken as a target phonetic notation feature; taking an entity with the phonetic notation characteristics same as the target phonetic notation characteristics in the entity dictionary as a candidate entity; and selecting the candidate entity with the highest matching degree with the target vocabulary to replace the target vocabulary in the first text information. The method realizes the detection and error correction of the entities in the voice recognition result and improves the accuracy of the voice recognition result.

Description

Voice processing method, system, electronic equipment and storage medium

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a method, a system, an electronic device, and a storage medium for speech processing.

Background

In speech recognition technology, entity recognition is an important branch. An entity is also called a named entity, and refers to an entity with a specific meaning, such as a proper noun like a name of a person, a name of an organization, a name of a place, and a meaningful time.

The recognition rate of the speech recognition model to the entity words is not high because the entity words in the training data of the speech recognition model are few. Therefore, the entity in the speech recognition result needs to be checked and corrected, so as to solve the problem of inaccurate entity recognition in speech processing.

Disclosure of Invention

Embodiments of the present application provide a method, a system, an electronic device, and a storage medium for speech processing, which can solve at least part of the above problems.

In a first aspect, an embodiment of the present application provides a method for speech processing, including:

acquiring first text information, wherein the first text information is a recognition result of voice information;

extracting phonetic notation characteristics of each vocabulary in the first text information, wherein the phonetic notation characteristics are characteristics of vocabulary phonetic notation information;

if the phonetic notation feature of any vocabulary in the first text information is determined to be in the entity dictionary, taking the vocabulary as a target vocabulary, wherein the phonetic notation feature of the target vocabulary is a target phonetic notation feature;

taking an entity with phonetic notation characteristics identical to the target phonetic notation characteristics in the entity dictionary as a candidate entity;

and selecting a candidate entity with the highest matching degree with the target vocabulary to replace the target vocabulary in the first text information.

It should be understood that the method and the device realize the checking and error correction of the entity in the voice recognition result and improve the accuracy of the voice recognition result by extracting the pinyin characteristics of the entity in the voice recognition result, inquiring the entity dictionary to determine the target vocabulary and the candidate entity, and selecting the candidate entity with the highest matching degree with the target vocabulary to replace the target vocabulary.

In a second aspect, an embodiment of the present application provides an apparatus for speech processing, including:

the first text information acquisition module is used for acquiring first text information, wherein the first text information is a recognition result of the voice information;

the phonetic notation feature extraction module is used for extracting phonetic notation features of all vocabularies in the first text information, wherein the phonetic notation features are the features of the vocabulary phonetic notation information;

the target vocabulary determining module is used for taking any vocabulary in the first text information as a target vocabulary if the phonetic notation characteristics of the vocabulary are determined to be in the entity dictionary, and the phonetic notation characteristics of the target vocabulary are target phonetic notation characteristics;

the candidate entity determining module is used for taking an entity with the phonetic notation characteristics same as the target phonetic notation characteristics in the entity dictionary as a candidate entity;

and the target vocabulary replacement module is used for selecting the candidate entity with the highest matching degree with the target vocabulary to replace the target vocabulary in the first text message.

In a third aspect, an embodiment of the present application provides a speech processing system, where the speech processing system is configured to implement the steps of the method in the first aspect.

In a fourth aspect, an embodiment of the present application provides an electronic device, including: a memory, a processor and a computer program stored in the memory and executable on the processor, the computer program, when executed by the processor, implementing the speech processing system of the third aspect.

In a fifth aspect, an embodiment of the present application provides a computer-readable storage medium, including: the computer readable storage medium stores a computer program which, when executed by a processor, performs the method steps of the first aspect described above.

In a sixth aspect, embodiments of the present application provide a computer program product, which, when run on an electronic device, causes the electronic device to perform the method steps of the first aspect.

It is understood that the beneficial effects of the second to sixth aspects can be seen from the description of the first aspect, and are not described herein again.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

FIG. 1 is a schematic diagram of a speech processing system according to an embodiment of the present application;

FIG. 2 is a flow chart illustrating a method of speech processing according to an embodiment of the present application;

FIG. 3 is a flowchart illustrating a method for generating an entity dictionary according to another embodiment of the present application;

FIG. 4 is a flowchart illustrating a method for generating an entity dictionary according to another embodiment of the present application;

FIG. 5 is a flowchart illustrating a method for generating an entity dictionary according to another embodiment of the present application;

FIG. 6 is a flow chart illustrating a method of speech processing according to another embodiment of the present application;

FIG. 7 is a flow chart illustrating a method of speech processing according to another embodiment of the present application;

FIG. 8 is a flow chart illustrating a method of speech processing according to another embodiment of the present application;

FIG. 9a is a schematic diagram of calculating a degree of matching according to an embodiment of the present application;

FIG. 9b is a schematic diagram of calculating a degree of matching according to an embodiment of the present application;

FIG. 10 is a flow chart illustrating a method of speech processing according to another embodiment of the present application;

FIG. 11 is a flow chart illustrating a method of speech processing according to another embodiment of the present application;

FIG. 12 is a schematic structural diagram of an apparatus for speech processing according to an embodiment of the present application;

fig. 13 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to" determining "or" in response to detecting ". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.

Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.

Before explaining the method of speech processing provided by the embodiment of the present application, for convenience of understanding of the embodiment of the present application, the principle of the method of speech processing provided by the embodiment of the present application and related concepts involved in the embodiment of the present application are explained below.

An entity, also called Named Entity (NE), is a term that refers to an entity with a specific meaning, such as a proper term like a name of a person, a name of an organization, a name of a place, and a meaningful time

A dictionary, a dictionary in general, is a tool book used to explain the meaning, concept, and usage of words. The digitized dictionary can be thought of as a collection of correspondences between words and their phonetic notations, paraphrases, or examples. In the field of information technology, a dictionary may be understood as a variable container model, and in some embodiments of the present application, a correspondence relationship formed by a data structure in which dictionary elements are key-values (key-values) may be used. The key-value correspondence may be a one-to-one correspondence, or a one-to-many correspondence, that is, one key corresponds to one value, or one key corresponds to a plurality of values.

And the entity dictionary is a dictionary of entities. When the computer is adopted for implementation, the entity can be a key, the phonetic notation of the entity and a dictionary interpreted as value; or a dictionary with the phonetic notation of key and the entity of value; the dictionary may also be a dictionary with a phonetic feature of key and an entity of value, and in some embodiments, the dictionary may be a dictionary with a phonetic feature of key and corresponding to a plurality of entities of value.

The general dictionary includes a dictionary including words of various parts of speech. For example, words of various parts of speech, including verbs, nouns, adjectives, etc., may be included. When the computer is adopted for implementation, the vocabulary can be key, the phonetic notation of the vocabulary and a dictionary interpreted as value; the dictionary may also be a dictionary with the phonetic notation of key and the vocabulary of value.

A corpus, which is a large-scale electronic text library that is scientifically sampled and processed, and stores the language material that actually appears in the actual use of the language.

An entity corpus is a corpus in which entities are stored.

The text information is a carrying manner of text, such as codes, characters, images, and the like. Text information is often expressed in encoded form, such as ASCII code, in data processed by a computer.

The ZhuYin information is a bearing mode of the ZhuYin of the text, such as a code of the ZhuYin, a ZhuYin symbol or an image.

Feature extraction is a method of transforming a set of measurements of a pattern to emphasize that the pattern has representative features. In the embodiment of the application, the step of extracting the phonetic notation features of each word in the text information may be understood as a process of extracting representative features in the phonetic notation information of the text information and obtaining the commonality of the phonetic notation information and the phonetic notation information of the word with similar pronunciation.

The phonetic notation feature refers to a feature of the phonetic notation information, which is obtained by performing feature extraction operation on the phonetic notation information, and is common with the phonetic notation information of the similar pronunciation vocabulary.

The voice information is a bearing mode of human voice. In the embodiments of the present application, the speech signal may be recorded in real time, or may be an analog or digital speech signal stored in a storage medium.

The speech information recognition is a process of converting speech information into data which can be processed by a computer by adopting a speech recognition model.

A speech recognition model is an algorithmic model that finds the most likely sequence of words given a speech input.

The matching degree refers to the degree of similarity between two comparison objects. The normal matching degree is high in order that the similarity between the two objects is high.

The comprehensive matching degree refers to a result obtained by performing preset operation on the matching degrees of the multiple groups of comparison objects. In some embodiments, the predetermined operation may be a weighted sum or the like.

An index refers to a storage structure that orders data.

A dictionary tree, also called Trie tree or word-lookup tree, is an index structure of a tree structure. Trie trees are commonly used for character lookup. In the embodiment of the present application, the Trie is used for searching the phonetic notation feature information, that is, each node is a sound unit of a phonetic notation feature.

The sound unit is a basic unit for forming phonetic notation information, takes the phonetic notation information of Chinese as an example, and comprises at least one of initial consonant, vowel and whole syllables.

Fuzzy sound units, which are easily confused due to the same or similar pronunciation, may be generated due to the same semantic meaning that pronunciations are different in different dialects. In some embodiments of the present application, the fuzzy sound units are predetermined sound units that are easily confused by dialect or pronunciation, and refer to table 1 specifically.

The sound unit comprises a normalized sound unit and a plurality of sound units corresponding to the fuzzy sound units, wherein the sound units are predetermined, and one normalized sound unit corresponds to the fuzzy sound units.

And the normalization operation is to convert the fuzzy sound unit in the phonetic notation information into a normalization sound unit corresponding to the fuzzy sound unit according to a preset rule.

And the word segmentation means segmenting words in the text information. As a non-limiting example, boundary markers such as spaces, slashes, bars, or the like are added from vocabulary to vocabulary.

In speech information recognition, a corpus typically used for training speech recognition models contains fewer entities, resulting in a lower recognition rate of the speech recognition models for entities compared to non-solid words. For example: the entity word of Shaoguan city is rarely appeared in the training corpus, so the entity is difficult to be recognized by the trained speech recognition model. However, in the training corpus, non-entity words such as "speech", "recognition" and "chinese" or ordinary words may be filled everywhere, so the trained model has a high recognition rate for the non-entity words.

The embodiment of the application provides a method for correcting errors of error entities in a voice recognition result after the voice recognition. In some embodiments, a method for speech processing system to implement speech processing is provided. In some embodiments, an entity error correction unit may be added to the speech processing system to correct errors in the recognition result. The method is based on the following recognition of the speech information recognition technique by the applicant of the present application:

first, Chinese rarely swallows or continues to read, so it can be assumed that the number of words output by speech recognition is the same.

Secondly, the pronunciation of the correct entity is similar to that of the misrecognized word of the entity, so that the pronunciation of the erroneous word recognized by the speech information recognition technology is similar to that of the correct entity and cannot be too far away.

Thirdly, it can be assumed that the recognition results of english, numbers, symbols, and the like are all correct by the voice information recognition technology, and this finding can be used to limit the processing range.

Based on the above recognition of speech information recognition by the applicant, the present application provides a speech processing method applied in a speech processing system. Based on the above third recognition, it can be considered that english, numerals, symbols, and the like in the voice recognition result are correct. Thus, it is possible to process only information other than english, numerals and symbols, for example, only chinese character information. Based on the recognition of the first point and the second point, the correct entity word numbers corresponding to the error entity and the error entity should be the same and the pronunciation should be similar, so the error entity and the correct entity should have a common part, that is, the same phonetic notation characteristics. Therefore, by extracting the phonetic notation characteristics of the phonetic notation information, the query of similar pronunciation vocabularies can be facilitated. The vocabulary can be determined to be a target vocabulary that is likely to be erroneous by querying whether the phonetic features of any vocabulary are in the physical dictionary. And taking the entity with the same phonetic notation characteristics as the target vocabulary in the entity dictionary as a candidate entity. Based on the above second recognition, the candidate entity having the highest degree of matching with the target vocabulary may be used as the replacement entity for replacing the target vocabulary. Based on the first recognition above, the correct entity and the target vocabulary should have the same number of words, and the target vocabulary should be replaced with a replacement entity.

Embodiments of the speech processing method provided in the present application are described below with reference to the drawings.

Fig. 1 illustrates a speech processing system 10 according to an embodiment of the present application. In some embodiments, the speech processing system 10 includes: a physical error correction unit 110.

The entity error correction unit 110 is configured to search for an error entity in the first text information by using the speech processing method provided in the embodiment of the present application, correct the error entity, and output corrected text information after error correction.

In some embodiments, the speech processing system 10 also includes a speech recognition unit 120.

The speech recognition unit 120 is configured to convert the speech information into first text information. The first text information is text information that may contain an erroneous entity.

In some embodiments, the speech processing system 10 further includes an entity dictionary generation unit 130.

The entity dictionary generating unit 130 is configured to generate an entity dictionary. In some embodiments, the entity dictionary includes a correspondence between an entity and a phonetic notation feature of the entity.

In some embodiments, the speech recognition unit 120, and the entity error correction unit 110 may also be two different functional modules in the same computing device.

As a non-limiting example, the functional units of the speech processing system 10, such as the entity error correction unit 110, the speech recognition unit 120 and the entity dictionary generation unit 130, may be functional units of independent hardware entities, and the functional units of the entities may be coupled via a bus or a storage medium, and may also be coupled via a wired or wireless network connection.

In some embodiments, the various functional units of the speech processing system 10 may also be virtual functional units based on functional units having computational functionality. The respective virtual functional units are coupled by a virtual channel.

In some embodiments, the functional units of the speech processing system 10 may also be functional units that are partly independent hardware entities and partly functional units that are virtual on the basis of functional units having computing functionality. The physical functional units are coupled through a bus or a storage medium, and can also be coupled through a wired or wireless network connection, and the virtual functional units are coupled through a virtual channel.

As a non-limiting example, the speech recognition unit 120 may be a separate speech recognition device and the entity error correction unit 110 may also be a separate entity error correction device. For example, the speech recognition unit 120 may be a computing device that includes a speech input module and a text output module. The entity correction unit 110 may comprise a computing device including a text input module and a text output module. The speech recognition unit 120 and the physical error correction unit 110 may communicate in a short distance, wired or wireless manner, or may communicate remotely via a network.

Fig. 2 illustrates a method of speech processing provided by an embodiment of the present application, which is applied to the speech processing system 10 illustrated in fig. 1, and in some embodiments, may be implemented by software and/or hardware of the entity error correction unit 110 of the speech processing system 10.

As shown in fig. 2, the method includes steps S110 to S150. The specific realization principle of each step is as follows:

s110, acquiring first text information, wherein the first text information is a recognition result of the voice information.

In some embodiments, the first text information may be a real-time recognition result of the voice recognition unit 120 for the voice information; or may be a recognition result of the voice information by the voice recognition unit 120 stored in a storage medium.

As a non-limiting example, the speech recognition unit 120 of the speech processing system 10 acquires the user's speech information in real time and converts the speech information into first text information, which the entity error correction unit 110 acquires by means of wired or wireless communication.

As a non-limiting example, the speech recognition unit 120 of the speech processing system 10 acquires the speech information of the user, converts the speech information into the first text information, and stores the first text information in the storage medium. The entity error correction unit 110 acquires the first text information stored in the storage medium by means of wired or wireless communication.

Wherein the first text information is text information that may contain an erroneously identified entity.

And S120, extracting the phonetic notation characteristics of each vocabulary in the first text information, wherein the phonetic notation characteristics are the characteristics of the vocabulary phonetic notation information.

In some embodiments, the entity error correction unit 110 of the speech processing system 10 converts the first text information into ZhuYin information.

As a non-limiting example, the entity correction unit 110 converts the first text information into the ZhuYin information by referring to the text-ZhuYin look-up table.

It is understood that if the text in the text information is Chinese, the ZhuYin information is the corresponding Chinese Pinyin or ZhuYin symbol.

In some embodiments, based on the above second aspect, the phonetic notation features of the words in the first text information may be extracted by using a preset mapping rule. That is, similar sound units are mapped to the same sound unit in a unified manner to extract the phonetic notation features of each vocabulary in the first text information, that is, the phonetic notation features of each vocabulary in the first text information are extracted through the normalization operation.

In some embodiments, pre-trained neural networks are employed to extract features of the ZhuYin information. The input data of the neural network is phonetic notation information, and the output data is phonetic notation characteristics. The samples for training the neural network are a sample set containing a plurality of groups of ZhuYin information and ZhuYin characteristics corresponding to the ZhuYin information. The neural network can be a classification network or a deep learning network, and when the embodiment of the application is implemented, the neural network with the optimal performance can be selected in the prior art through limited tests.

And S130, if the phonetic notation feature of any vocabulary in the first text information is determined to be in the entity dictionary, taking the vocabulary as a target vocabulary, wherein the phonetic notation feature of the target vocabulary is the target phonetic notation feature.

The target vocabulary is the vocabulary in the first text message and is the vocabulary of the entity which is determined to be possible to be wrong through the voice processing method provided by the embodiment of the application.

It is to be understood that the first text information may contain one or more sentences and, thus, may also contain one or more erroneously recognized entities, and that the target words should therefore not be understood as the only erroneous words, but rather, the words in the first text information that are possibly erroneous can be the target words.

Due to the accuracy of the speech recognition module 120, the target vocabulary may be represented as an entity or a non-entity vocabulary in the first text message.

For example, the first text message is "the heaven is a place where the ancient emperor performs sacrifice. "where" paradise "is a word that is misrecognized, and shall be" Temple of heaven "herein, the target word is an entity, but is the wrong entity.

For another example, the first text message is "few matters are evaluated as excellent tourist cities", wherein "few matters" are wrongly recognized words, which should be "shaoguan city" here, and the target words which are wrongly recognized are non-entity words here.

In some embodiments, the words in the first text information are segmented into one or more combinations of words. And extracting the phonetic notation characteristics of each word after word segmentation, and searching whether the phonetic notation characteristics of an entity are the same as the phonetic notation characteristics or not in an entity dictionary or not. If there is an entity in the entity dictionary having the phonetic feature identical to the phonetic feature, it is considered that the word with a high probability should be an entity, and the word is taken as the target word.

In some embodiments, the first text information may be segmented in multiple forms. And extracting phonetic notation characteristics of the words segmented from each word segmentation form, and inquiring whether the phonetic notation characteristics are in an entity dictionary.

As a non-limiting example, the following different forms of segmentation can be obtained for the word "a unknown floret attracts his attention to the flowers and plants on lake bank" in the Ming Dynasty. For the words such as "lakeshore", "flowers and plants", "unknown", etc., different word definition modes will have different word segmentation results, for example, we can cut into the following forms:

1. xiaoming/seen/lakeshore/on// flowers/grass, one strain/unknown/small flower/get/his/her/attention.

2. Xiaoming/seen/lake/bank/on/flower/grass, one strain/not/famous/flower/get/his/her/attention.

3. Xiaoming/seeing/lakeshore/on/flower/grass, one strain/unknown/floret/aroused/his/attention.

In some embodiments, the phonetic notation features of each entity in the entity dictionary are extracted, the phonetic notation features of each entity are compared with the phonetic notation features of the vocabulary to be compared, and if the phonetic notation features of each entity are the same, the vocabulary is indicated as the target vocabulary. If the entity dictionary is a dictionary only containing entities or only containing the paraphrases of the entities and the entities, the phonetic notation information of each entity in the entity dictionary can be obtained, and the phonetic notation characteristics are extracted and then compared with the phonetic notation characteristics of the vocabulary to be compared; if the entity dictionary contains the phonetic notation information of the entity, the phonetic notation characteristics of the phonetic notation information can be directly extracted for comparison; if the entity dictionary contains the phonetic notation characteristics of the entity, whether the phonetic notation characteristics of the vocabulary needing to be compared exist can be directly inquired.

It is understood that the words included in the entity dictionary are entities, and if the corresponding entity can be found in the entity dictionary according to the phonetic notation of a word, the word with a high probability is an entity.

And S140, taking the entity with the phonetic notation characteristics same as the target phonetic notation characteristics in the entity dictionary as a candidate entity.

In some embodiments, the entity error correction module 120 of the speech processing system 10 traverses each entity in the entity dictionary, compares the ZhuYin characteristic of each entity with the target ZhuYin characteristic, and takes the entity with the same ZhuYin characteristic as the target ZhuYin characteristic as a candidate entity.

In some embodiments, in the entity dictionary, the entity and the phonetic notation features of the entity are in a one-to-one correspondence relationship, and the phonetic notation features of the vocabulary are determined in the entity dictionary by inquiring whether the phonetic notation features identical to the phonetic notation features to be compared exist one by one.

In some embodiments, the entity dictionary includes the correspondence between the phonetic features of the entities and the entities, or one or more entities are included in one phonetic feature entry. And searching the entity dictionary for the phonetic notation characteristics same as the target phonetic notation, and taking the entity corresponding to the phonetic notation characteristics as a candidate entity.

As a non-limiting example, the index information in the entity dictionary is phonetic features, each phonetic feature corresponding to one or more entities, and one possible computer implementation is that the phonetic features are keys and the entities are values. The key corresponding entity value such as the ZhuYin feature "nanjin" includes "Nanjing", "Nanjin", "Nanjing", "blue whale", "blue crystal", and "blue scene". If the first text information is 'gold is six ancient ways', the phonetic notation feature of the 'gold' is 'nanjin', the same phonetic notation feature and the entity corresponding to the phonetic notation feature can be quickly found by inquiring in the entity dictionary of which the index information is the phonetic notation feature, and the entities can be used as candidate entities.

It will be appreciated that entities are all included in the entity dictionary, and then candidate entities in the entity dictionary may be determined by the different comparison methods described above. However, if the entity dictionary is indexed by the phonetic notation features and contains a dictionary of the corresponding relationship between the phonetic notation features and one or more entities, the query efficiency can be greatly improved.

S150, selecting the candidate entity with the highest matching degree with the target vocabulary to replace the target vocabulary in the first text message.

In some embodiments, the entity error correction module 120 of the speech processing system 10 calculates the matching degree of the target vocabulary with each candidate entity one by one, and uses the candidate entity with the highest matching degree as the replacement entity to replace the target vocabulary in the first text information.

As a non-limiting example, the degree of match between the target vocabulary and the candidate entity may be determined by calculating the edit distance of the vocabulary. As a non-limiting example, the degree of match between the target vocabulary and the candidate entity may be determined by a deep neural network model. As a non-limiting example, the degree of match between the target vocabulary and the candidate entity may be determined by computing word vectors for the vocabulary.

In some embodiments, after determining that the phonetic transcription feature of any vocabulary in the first text information is in the entity dictionary, the method further includes: and extracting the target position of the target vocabulary in the first text information. Correspondingly, selecting the candidate entity with the highest matching degree with the target vocabulary to replace the target vocabulary in the first text message comprises: and selecting a candidate entity with the highest matching degree with the target vocabulary, and replacing the target vocabulary in the first text information at the target position.

It should be understood that the method and the device realize the checking and error correction of the entity in the voice recognition result and improve the accuracy of the voice recognition result by extracting the pinyin characteristics of the entity in the voice recognition result, determining the target vocabulary and the candidate entity by inquiring the entity dictionary, and selecting the candidate entity with the highest matching degree with the target vocabulary to replace the target vocabulary.

In the method for processing speech provided in the embodiment of the present application shown in fig. 2, an entity dictionary including a correspondence between an entity and phonetic notation features of the entity is used, so that an effect of greatly improving query efficiency can be obtained. Fig. 3 illustrates a method for generating a corresponding relationship entity dictionary including an entity and phonetic notation features of the entity according to an embodiment of the present application. The method for generating the entity dictionary may be implemented by software and/or hardware of the entity dictionary generating unit 130 in the speech processing system 10. As shown in fig. 3, the method includes steps S310 to S330. The specific realization principle of each step is as follows:

s310, an entity corpus is obtained.

In some embodiments, the entity corpus may be a vocabulary set including only entities, or may be an electronic entity dictionary, but the entity dictionary does not include correspondence between entities and phonetic notation features of the entities. The dictionary generating unit 130 of the speech processing system 10 may acquire the entity corpus from a server that provides the entity corpus, or may acquire the entity corpus from a storage medium that stores the entity corpus.

And S320, extracting the phonetic notation characteristics of each entity in the entity corpus.

In some embodiments, the dictionary generating unit 130 of the speech processing system 10 obtains the phonetic notation information of each entity in the corpus, extracts the phonetic notation feature of each phonetic notation information, and establishes the corresponding relationship between the entity and the phonetic notation feature. Note that the phonetic notation features of the respective entities should be extracted in the same manner as in the above-described embodiment. For example, if the sound-notational features of the sound-notational information extracted by the neural network model are adopted in the step S120, the sound-notational features of the sound-notational information should be extracted by the neural network model in this step; if the normalization processing is used to extract the phonetic notation features of the phonetic notation information in S120, the normalization processing should also be used to extract the phonetic notation features of the phonetic notation information in this step to ensure that the phonetic notation features can be queried and compared.

And S330, merging the entities in the entity corpus according to the phonetic notation features of the entities to generate the entity dictionary, wherein the merging operation comprises merging the entities with the same phonetic notation features into entries corresponding to the phonetic notation features.

As a non-limiting example, the entity dictionary takes the phonetic feature as key and takes one or more entities as value, that is, the key of one phonetic feature corresponds to the value of one or more entities.

In addition to the method for generating an entity dictionary provided in fig. 3, before generating an entity dictionary based on the entity corpus, as shown in fig. 4, the method further includes steps S301 to S303:

s301, acquiring a general dictionary.

In some embodiments, the dictionary generation unit 130 of the speech processing system 10 may retrieve the universal dictionary from a storage medium or server that stores the universal dictionary.

S302, determining basic words in the general dictionary according to the parts of speech of the entries in the general dictionary.

S303, removing the entity formed by the basic words in the entity corpus

In some implementations, the entities in the entity corpus can be filtered according to a general dictionary. Filtering the entities in the entity corpus refers to removing redundant entities in the entity corpus.

The general dictionary contains words of various parts of speech, such as "n" noun, "v" verb, "i" idiom, "l" idiom, "vn" noun verb, "t" time word, "m" number word, "d" adverb, "z" status word, "ad" adverb, "a" adjective, and so on. Among the nouns, it also includes: nr, name of person; ns, place name; nt, organization group; nz, other proper names, etc.

Some vocabularies of parts of speech can be selected as basic words, and the vocabularies of parts of speech can be combined into entities. For example, verbs, helpwords, and subdivided nouns, e.g., nr, ns, nt, nz, are used as base words to delete (i.e., filter) entities in the entity corpus that are composed of these base words.

As a non-limiting example, the song entity "drawn baby", which is composed of three common words "draw", "drawn", "baby", so that the entity word can be correctly recognized by the speech recognition unit theoretically, and therefore, related entities in the entity corpus can be filtered out.

In some embodiments, the base word may be a vocabulary of more than two words.

It should be understood that the number of entities in the entity corpus may be very large, and the number of entities in some entity corpora may reach 30 ten thousand. If the speech processing method shown in fig. 2 is adopted, the number of entities in the entity corpus is too large, which causes a problem of slow error correction speed. The entity dictionary obtained after filtering a part of entities is adopted, so that the error correction speed can be improved.

On the basis of the entity dictionary generating method provided in fig. 4, as shown in fig. 5, the method further includes step S340:

s340, generating a phonetic feature index according to phonetic features in each entry in the entity dictionary.

In some embodiments, the index may be a Trie, each node of the Trie being a sound unit, the Trie being used to index phonetic features in entries in the entity dictionary.

It should be understood that, by the method for generating an entity dictionary provided in the embodiments of the present application, an entity dictionary is generated in advance, and the entity dictionary includes a correspondence between an entity and a phonetic notation feature of the entity, so that the speed of determining a target vocabulary can be increased, and in addition, the speed of searching for a candidate entity can be increased.

In addition, whether the phonetic notation features of the target vocabulary are in the entity dictionary or not is inquired through the Trie tree, the time can be changed by using space, and the inquiry time is shortened.

On the basis of the embodiment of the speech processing method provided in fig. 2, step S120 is to extract the phonetic notation features of each vocabulary in the first text message, as shown in fig. 6, and includes steps S121 and S122:

s121, obtaining second text information based on the first text information; the second text information comprises phonetic notation information of each vocabulary in the first text information.

In some embodiments, the entity error correction unit 110 of the speech processing system 10 converts the one or more sentences encoded with chinese characters into one or more sentences encoded with chinese pinyin by consulting a hanzi-pinyin look-up table.

As a non-limiting example, the entity correction unit 110 converts the sentence into "shao guan shibei ping wei xu lv you chengshi" by the first text information being "the stub is rated as the excellent tourist city" output by the voice recognition unit 120.

In some embodiments, the ZhuYin information may be a Chinese Pinyin of the first text information, or an encoding of the Chinese Pinyin; and phonetic symbols, such as those used in Taiwan of China. It is understood that the embodiments of the present application are equally applicable to the processing of ideograms having phonetic symbols, such as japanese and korean chinese characters and their corresponding phonetic symbols.

And S122, converting the fuzzy sound unit in each phonetic notation information into a normalized sound unit to obtain the phonetic notation characteristics of each phonetic notation information.

In some embodiments, the entity error correction unit 110 of the speech processing system 10 converts the ambiguous sound units to normalized sound units over a mapping network. The mapping network is used to map the ambiguous sound units and the normalized sound units. The mapping network may be a neural network model trained using a training sample set of fuzzy sound units and normalized sound units.

In some embodiments, converting the ambiguous sound units in each ZhuYin information to normalized sound units comprises: and inquiring a preset comparison table, converting the fuzzy sound units in each phonetic notation information into corresponding normalized sound units, and obtaining the phonetic notation characteristics of each phonetic notation information, wherein the preset comparison table comprises the corresponding relation between the preset fuzzy sound units and the preset normalized sound units.

For better understanding of the examples of the present application, table 1 provides a preset comparison table in one implementation. In implementing the embodiments of the present application, the lookup table can be adjusted and expanded according to the language or dialect, and table 1 can be any data form that can be easily processed by a computer, such as a data table, a database, etc. Table 1 is exemplary only and not limiting. The cell marked "-" in table 1 indicates that this cell is empty.

Table 1:

normalizing sound units	Fuzzy sound 1	Fuzzy 2	Fuzzy 3
				n	n	l	r
s	sh	s	-
				z	zh	z	-
c	ch	c	x
				in	in	ing	-
an	an	ang	-
				en	en	eng	-
h	h	f	-

As a non-limiting example, by referring to table 1, the fuzzy sound unit in the second text information "Shao guan shibei ping wei yu xu lv you chengshi" is converted into a normalized sound unit, and "sao guan sibei xi you xu lv you chensi" is obtained as the ZhuYin feature of the first text information.

It will be appreciated that based on the above-identified second aspect of the speech recognition technique by the applicant, the correct entity has a similar pronunciation to the misrecognized word of that entity, and therefore it can be assumed that the misrecognized word identified by the speech information recognition technique is similar, not too far away, to the pronunciation of the correct entity. Therefore, the method of carrying out normalization processing on the fuzzy sound and converting the fuzzy sound into the normalized sound unit is adopted to extract the phonetic notation characteristics of the words, namely the commonalities of the words, so that the complexity of characteristic extraction can be reduced, the speed of characteristic extraction is increased, and the processing efficiency of entity error correction is further improved. In addition, the fuzzy sound unit is converted into the normalized sound unit in a preset comparison table mode, the effect of space time conversion can be achieved, and the processing efficiency of entity error correction is further improved.

On the basis of the embodiment of the speech processing method provided by fig. 2, a phonetic feature index is generated according to phonetic features in each entry in the entity dictionary. As shown in fig. 7, in step S130, it is determined that the phonetic transcription feature of any vocabulary in the first text information is in the entity dictionary. Step S130' may be replaced with:

s130', determining the phonetic feature of any vocabulary in the first text information in an entity dictionary through the phonetic feature index.

In some examples, the ZhuYin feature index includes a ZhuYin feature dictionary tree, a Trie tree.

In some embodiments, the ZhuYin feature index is generated by the method of step S330. As a non-limiting example, the entity correcting unit 110 of the speech processing system 10 queries whether the phonetic notation feature of each word in the first text message is included in the Trie one by one, and if any word is queried in the Trie, the word is indicated to be in the entity dictionary.

It should be appreciated that employing an index to query whether the ZhuYin feature is in the entity dictionary may improve query speed. The Trie tree is adopted for query, so that the space time changing effect can be achieved, and the query speed is increased. In addition, the Trie tree usually adopts a tree-shaped storage structure with characters as nodes, and the tree-shaped storage structure with sound units with phonetic notation characteristics as nodes can be more suitable for querying the phonetic notation characteristics, so that the query efficiency is further improved.

Based on the embodiment of the method for processing speech provided in fig. 2, step 150, the candidate entity with the highest degree of matching with the target vocabulary is selected to replace the target vocabulary in the first text message. As shown in fig. 8, steps S151 to S152 are included:

s151, aiming at each candidate entity, obtaining the number of the same words between the target word and the candidate entity, obtaining the number of the same phonetic notation information between the target word and the candidate entity, calculating the weighted sum of the number of the same words and the number of the same phonetic notation information, and taking the weighted sum as the matching degree between the candidate entity and the target word.

As a non-limiting example, the target vocabulary is "heaven," and one candidate entity is "Temple of heaven," with the number of identical words for both vocabularies being 1.

As a non-limiting example, the target word is "unmanaged" and the candidate entity is "Shaoguan City", the number of identical words for both words being 0.

As a non-limiting example, the phonetic notation information of the target word "rare event" is "shaoguanashi", the phonetic notation information of the candidate entity "shaoguanashi" is "shaoguanashi", and the number of identical phonetic notation information of the two phonetic notation information is 3.

As a non-limiting example, the phonetic notation information of the target word "rare event" is "shaoguanashi", the phonetic notation information of the candidate entity "shaoguangshi" is "shaoguangsi", and the number of identical phonetic notation information of the two phonetic notation information is 2.

It should be understood that the number of the same words and the number of the same phonetic notation information between the target vocabulary and the candidate entity are respectively calculated, so that the matching degree between the target vocabulary and the candidate entity can be measured from multiple dimensions, and the excessive matching degree deviation caused by a single dimension is avoided, thereby improving the accuracy of the matching degree.

In some embodiments, the matching degree may be calculated by calculating a weighted sum of the number of identical words and the number of identical ZhuYin information.

As a non-limiting example, the weight a of the number of same words is 0.6, the weight b of the number of same phonetic notation information is 0.4, and the first text information is "the few-careers are rated as the excellent tourist city. ", the candidate entities were determined to have" Shaoguan city "and" Shaoguang city ".

The target vocabulary is 'few management affairs', and the phonetic notation information is 'shaoguanashi'; the selected entity is Shaoguanashi, and the phonetic notation information is Shaoguanashi. Then the number of the same word is 0, the number of the same phonetic information is 3, and the matching degree is a 0+ b 3 + 0.6 +0.4 + 3-1.2.

The target vocabulary is 'few management affairs', and the phonetic notation information is 'shaoguanashi'; the selected entity is Shaoguangshi, and the phonetic notation information is SHAOGuangshi. Then the number of the same word is 0, the number of the same phonetic information is 3, and the matching degree is a 0+ b 3-0.6 + 0.4-2-0.8.

Fig. 9a and 9b are schematic diagrams illustrating a method for calculating a matching degree between a target vocabulary and a candidate entity according to an embodiment of the present application. In this example, the weight a of the number of same words is 1, and the weight b of the number of same ZhuYin information is 1. This example is a specific implementation of computer code, and as shown in fig. 9a and 9b, SCORE1 and SCORE2 are the matching degrees of the target vocabulary with two candidate entities, respectively, len () represents the number of the elements of the set, set () represents the data type of the set, and the parenthesis of set () represents the elements of the set, & represents the intersection operator of the two sets. To more intuitively understand this particular example, solid arrows are used to identify the same elements, and dashed arrows are used to identify different elements.

It should be understood that the weight may be adjusted based on a limited number of experiments according to actual situations, and the example of the weight provided in the embodiment of the present application is not a limitation on the method of speech processing provided in the present application.

It should be appreciated that, based on the second point of recognition of the speech information recognition technique described above, the pronunciation of the correct entity is similar to the misrecognized vocabulary of that entity, and thus it can be assumed that the misrecognized vocabulary recognized by the speech information recognition technique is similar to, and not too far from, the pronunciation of the correct entity. The matching degree is calculated by adopting the number of the same words between the target vocabulary and the candidate entity and the number of the same phonetic notation information between the target vocabulary and the candidate entity, so that the complexity of operation can be reduced, the occupation of computing resources is reduced, the processing speed is increased, and the processing efficiency is improved.

S152, selecting the candidate entity with the highest matching degree to replace the target vocabulary.

In some embodiments, the matching degrees of the candidate entities and the target vocabulary are ranked, and the candidate entity with the highest matching degree is selected to replace the target vocabulary.

As a non-limiting example, as a result of the calculation of the degree of matching, the degree of matching between "few talents" and "shaoguan city" is 1.2, the degree of matching between "few talents" and "shaoguang city" is 0.8, the degree of matching between the candidate entity "shaoguan city" is the highest, and the candidate entity "shaoguan city" is selected to replace the target word "few talents".

In some embodiments, if the matching degrees of the candidate entities and the target vocabulary are the same, a prompt message may be generated, and the selectable entities are presented to the user in a prompt message center. And responding to the selection operation of the user, and replacing the target vocabulary with the candidate entity selected by the user. And further adopting other matching algorithms to calculate the matching degree of the candidate entities and the target vocabulary, and adopting the candidate entities with the highest matching degree to replace the target vocabulary.

It will be appreciated that, based on the above-mentioned applicant's knowledge of the second aspect of the speech recognition technology, the matching degree can be determined quickly by using the number of the same words and the number of the same phonetic notation information, and the calculation amount is faster and requires less hardware calculation resources compared to other matching degree calculation methods.

It should be understood that the various implementations in the embodiments shown in fig. 3, 4, and 5 can be combined and applied reasonably, and in order to illustrate how to combine the various implementations in the embodiments shown in fig. 3, 4, and 5, as a non-limiting example, fig. 10 shows a specific entity dictionary generation method for better understanding and implementing the embodiments of the present application.

The speech processing system 10 according to the embodiment of the present application obtains an entity corpus, which may be an electronic entity dictionary that does not include a correspondence between an entity and a phonetic notation feature of the entity. And filtering the entity corpus to remove entities which can be formed by basic words of a common dictionary to obtain the filtered entity corpus. And extracting the phonetic notation features of each entity from the filtered entity corpus, and merging to obtain an entity dictionary containing the corresponding relation between the entities and the phonetic notation features of the entities. And establishing an index for an entity dictionary containing the entity and the phonetic notation feature corresponding relation of the entity to obtain a phonetic notation feature Trie tree. Thus, an entity dictionary containing the corresponding relation between the entity and the phonetic notation features of the entity and a Trie tree composed of the phonetic notation features of the entity in the entity dictionary can be obtained.

It should be understood that the various embodiments of the speech processing method provided in the above embodiments can be reasonably combined and applied, and in order to illustrate how to combine the various embodiments in the above embodiments, as a non-limiting example, fig. 11 shows a specific speech processing method so as to better understand and implement the embodiments of the present application.

As shown in fig. 11, the first textual information may be a chinese sentence, which is "the colleague is rated as the excellent tourist city" as a non-limiting example. After the word segmentation processing is carried out on the text information, normalization processing is carried out on each word after word segmentation, and the phonetic notation characteristics of each word are obtained. And indexing through a pre-established Trie tree, and if the phonetic notation characteristics of any vocabulary can be inquired in the Trie tree, determining that the vocabulary is the target vocabulary. And querying an entity corresponding to the phonetic notation characteristics of the target vocabulary in the entity dictionary to serve as a candidate entity. Obtaining the number of the same words between the target vocabulary and the candidate entity, obtaining the number of the same phonetic notation information between the target vocabulary and the candidate entity, calculating the weighted sum of the number of the same words and the number of the same phonetic notation information, taking the weighted sum as the matching degree between the candidate entity and the target vocabulary, and taking the entity with the highest matching degree as a replacement entity. And replacing the target vocabulary in the first text information by adopting the replacing entity to obtain the corrected text information. It should be understood that if a plurality of target words are included in the first text message, the above steps are repeated to correct all the target words.

It should be understood that the specific entities shown in fig. 10 and 11 are for the purpose of explaining how to combine and apply the various embodiments of the present application, and are not intended to limit the present application in any way.

For the convenience of understanding and implementing the embodiments of the present application, several methods of speech processing provided by the embodiments of the present application are provided as a non-limiting example application scenario.

The embodiment of the application provides an application scene of a voice processing method in a robot. The robot includes the voice processing system 10, and after acquiring the voice of the user, the robot corrects the recognition result of the voice information of the user by the voice processing method through the voice processing system to obtain the corrected voice recognition result. The robot can interact with the user according to the corrected voice recognition result, and actions including but not limited to answering the user questions, executing user instructions and the like are carried out. The robot can be a device for executing the human-computer command and finishing the operation corresponding to the human-computer command. By way of example and not limitation, the robot may be a walking robot, a fixed conversational robot. It is understood that a smart car can be considered a special robot.

The embodiment of the application provides an application scene in a translator. The translator includes the above-mentioned speech processing system 10, and after the translator acquires the speech of the user, the translator corrects the recognition result of the speech information of the user by the above-mentioned speech processing method through the above-mentioned speech processing system, and obtains the corrected speech recognition result. The translator translates the corrected speech recognition result into a language specified by a user or preset, displays the translation result in a text information display mode, or synthesizes the translation result into speech, and displays the translation result through a sound production device.

The embodiment of the application provides an application scene of voice input equipment. The voice input device comprises the voice processing system 10, and after the voice input device acquires the voice of the user, the voice processing system corrects the recognition result of the voice information of the user by the voice processing method to obtain the corrected voice recognition result. The voice input device may perform operations on the modified voice recognition results including, but not limited to, storing, displaying, and transmitting to other devices. By way of example and not limitation, the voice input device may be a mobile phone, a tablet computer, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a Personal Digital Assistant (PDA), or other terminal device

The embodiment of the application provides an application scene of wearable equipment. A wearable device is a portable device that is worn directly on the user or integrated into the user's clothing or accessories. The wearable device is not only a hardware device, but also can realize more functions through software support, data interaction and cloud interaction. As a non-limiting example, the wearable device may be an earphone. The wearable device includes the voice processing system 10, and after the wearable device acquires the voice of the user, the wearable device modifies the recognition result of the voice information of the user by the voice processing method through the voice processing system to obtain a modified voice recognition result. The wearable device may perform operations on the revised speech recognition results including, but not limited to, storing, displaying, and transmitting to other devices.

By way of example and not limitation, when the terminal device is a wearable device, the wearable device may also be a generic term for intelligently designing daily wearing by applying wearable technology, developing wearable devices, such as glasses, gloves, watches, clothing, shoes, and the like. A wearable device is a portable device that is worn directly on the body or integrated into the clothing or accessories of the user. The wearable device is not only a hardware device, but also realizes powerful functions through software support, data interaction and cloud interaction. The generalized wearable intelligent device has the advantages that the generalized wearable intelligent device is complete in function and large in size, can realize complete or partial functions without depending on a smart phone, such as a smart watch or smart glasses, and only is concentrated on a certain application function, and needs to be matched with other devices such as the smart phone for use, such as various smart bracelets for monitoring physical signs, smart jewelry and the like.

Those skilled in the art will appreciate that the device robots, translators, speech input devices, wearable devices in the several application scenarios described above may contain more or fewer components, or combine certain components, or different arrangements of components.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

Corresponding to the method of speech processing shown in fig. 2, fig. 12 shows a speech processing apparatus M100 provided in an embodiment of the present application, including:

a first text information obtaining module M110 that obtains first text information, where the first text information is a recognition result of the voice information;

a phonetic notation feature extraction module M120, configured to extract phonetic notation features of each vocabulary in the first text information, where the phonetic notation features are features of vocabulary phonetic notation information;

the target vocabulary determining module M130 is used for taking any vocabulary in the first text information as a target vocabulary if the phonetic notation characteristics of the vocabulary are determined to be in the entity dictionary, wherein the phonetic notation characteristics of the target vocabulary are target phonetic notation characteristics;

a candidate entity determination module M140, configured to use an entity with the phonetic notation characteristics identical to the target phonetic notation characteristics in the entity dictionary as a candidate entity;

and the target vocabulary replacing module M150 selects the candidate entity with the highest matching degree with the target vocabulary to replace the target vocabulary in the first text message.

It is understood that various embodiments and combinations of the embodiments in the above embodiments and their advantages are also applicable to this embodiment, and are not described herein again.

Fig. 13 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device is used for realizing the voice processing system, and the electronic device can be a robot, a translator, a voice input device and a wearable device in various exemplary application scenarios. It should be understood that when the electronic device is a device in the above-mentioned various exemplary application scenarios, the electronic device may further include corresponding devices or components such as a power supply unit, a power unit, an input unit, an output unit, a communication unit, and the like, which are not listed here.

As shown in fig. 13, the electronic device D10 of this embodiment includes: at least one processor D100 (only one is shown in fig. 13), a memory D101, and a computer program D102 stored in the memory D101 and operable on the at least one processor D100, wherein the processor D100 implements the steps of any of the method embodiments described above when executing the computer program D102.

The electronic device D10 may be a robot, a translator, a voice input device, a wearable device, or the like. The electronic device may include, but is not limited to, a processor D100, a memory D101. Those skilled in the art will appreciate that fig. 13 is merely an example of the electronic device D10 and does not constitute a limitation of the electronic device D10, and may include more or fewer components than those shown, or some components in combination, or different components, such as input output devices, network access devices, etc.

Processor D100 may be a Central Processing Unit (CPU), and Processor D100 may be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The storage D101 may be an internal storage unit of the electronic device D10 in some embodiments, such as a hard disk or a memory of the electronic device D10. In other embodiments, the memory D101 may also be an external storage device of the electronic device D10, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device D10. Further, the memory D101 may also include both an internal storage unit and an external storage device of the electronic device D10. The memory D101 is used for storing an operating system, an application program, a BootLoader (BootLoader), data, and other programs, such as program codes of the computer programs. The memory D101 may also be used to temporarily store data that has been output or is to be output.

It should be noted that, for the information interaction, execution process, and other contents between the above-mentioned devices/units, the specific functions and technical effects thereof are based on the same concept as those of the embodiment of the method of the present application, and specific reference may be made to the part of the embodiment of the method, which is not described herein again.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the steps in the above-mentioned method embodiments may be implemented.

Embodiments of the present application provide a computer program product, which when executed on an electronic device, enables the electronic device to implement the steps in the above method embodiments.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the processes in the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium and can implement the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to a photographing apparatus/terminal apparatus, a recording medium, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signal, telecommunication signal, and software distribution medium. Such as a usb-disk, a removable hard disk, a magnetic or optical disk, etc. In certain jurisdictions, computer-readable media may not be an electrical carrier signal or a telecommunications signal in accordance with legislative and patent practice.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/network device and method may be implemented in other ways. For example, the above-described apparatus/network device embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A method of speech processing, comprising:

2. The method of claim 1, prior to obtaining the first textual information, further comprising:

acquiring an entity corpus;

extracting the phonetic notation characteristics of each entity in the entity corpus;

and according to the phonetic notation features of the entities, merging the entities in the entity corpus to generate the entity dictionary, wherein the merging operation comprises merging the entities with the same phonetic notation features into entries corresponding to the phonetic notation features.

3. The method of claim 2, further comprising:

acquiring a general dictionary;

determining a basic word in the general dictionary according to the part of speech of the entry in the general dictionary;

and removing the entities formed by the basic words in the entity corpus.

4. The method of claim 1, wherein extracting phonetic features of words in the first textual information comprises:

obtaining second text information based on the first text information; the second text information comprises phonetic notation information of each vocabulary in the first text information;

and converting the fuzzy sound unit in each phonetic notation information into a normalized sound unit to obtain the phonetic notation characteristics of each phonetic notation information.

5. The method of claim 4, wherein converting the ambiguous sound units in each ZhuYin information to normalized sound units comprises:

and inquiring a preset comparison table, converting the fuzzy sound units in each phonetic notation information into corresponding normalized sound units, and obtaining the phonetic notation characteristics of each phonetic notation information, wherein the preset comparison table comprises the corresponding relation between the preset fuzzy sound units and the preset normalized sound units.

6. The method of any of claims 1 to 5, further comprising:

generating a phonetic feature index according to phonetic features in each entry in the entity dictionary;

correspondingly, the determining that the phonetic transcription feature of any vocabulary in the first text information is in an entity dictionary comprises:

and determining the phonetic feature of any word in the first text information in an entity dictionary through the phonetic feature index.

7. The method of claim 1, wherein selecting the candidate entity with the highest degree of match with the target vocabulary to replace the target vocabulary in the first textual information comprises:

for each candidate entity, acquiring the number of the same words between the target word and the candidate entity, calculating the weighted sum of the number of the same words and the number of the same phonetic notation information, and taking the weighted sum as the matching degree between the candidate entity and the target word;

and selecting the candidate entity with the highest matching degree to replace the target vocabulary.

8. A speech processing system for implementing the method according to any of claims 1 to 7.

9. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the speech processing method according to any of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.