CN111128130A

CN111128130A - Voice data processing method and device and electronic device

Info

Publication number: CN111128130A
Application number: CN201911425436.4A
Authority: CN
Inventors: 刘洋; 梁志婷
Original assignee: Miaozhen Information Technology Co Ltd
Current assignee: Miaozhen Information Technology Co Ltd
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2020-05-08
Anticipated expiration: 2039-12-31
Also published as: CN111128130B

Abstract

The invention discloses a voice data processing method and device and an electronic device. Wherein, the method comprises the following steps: extracting phonemes with no tone features and phonemes with tone features in the voice information of the target object, wherein the phonemes correspond to the phonemes of characters to be recognized in the voice, and the tones correspond to tones of the characters to be recognized in the voice; acquiring a target phoneme library corresponding to the phonemes through the phonemes, wherein the target phoneme library comprises N tone phonemes corresponding to the phonemes, and N is a natural number greater than 1; the method and the device have the advantages that the phoneme corresponding to the voice information is determined by matching the tonal characteristics with the phonemes in the target phoneme library, the text corresponding to the voice information is recognized according to the text phonemes, the purpose of recognizing the voice information according to the phonemes and the tonal information is achieved, the technical effects that the target phoneme library is determined according to the phoneme characteristics, the text phonemes are determined according to the tonal characteristics are achieved, and the technical problem that in the prior art, the recognition accuracy of the voice information is low is solved.

Description

Voice data processing method and device and electronic device

Technical Field

The invention relates to the field of voice processing, in particular to a voice data processing method and device and an electronic device.

Background

The world has a plurality of tone languages, and the tone languages are characterized in that when only one voice is sent, tones with different lengths and different heights are used to form words and meanings with different meanings (namely, semantic meanings or semantics). In a tone language, a voice usually has a plurality of tones, so that many near-sound words are generated, and the meanings of the near-sound words are different. In the process of recognizing the Chinese characters by voices, according to the characteristics of the Chinese characters, the initials and finals of some near-sound characters are the same but the tones are different, the same Chinese character represents different meanings under different tones, and the same pronunciation may represent completely different Chinese characters, so that the accuracy rate of recognizing the near-sound characters by a voice recognition device is low.

In view of the above problems, no effective solution has been proposed.

Disclosure of Invention

The embodiment of the invention provides a voice data processing method and device and an electronic device, and at least solves the technical problem that in the prior art, the voice information recognition accuracy is low.

According to an aspect of an embodiment of the present invention, there is provided a voice data processing method, including: extracting phonemes with no tone features and phonemes with tone features in the voice information of the target object, wherein the phonemes correspond to the phonemes of characters to be recognized in the voice, and the tones correspond to tones of the characters to be recognized in the voice; acquiring a target phoneme library corresponding to the phonemes through the phonemes, wherein the target phoneme library comprises N tone phonemes corresponding to the phonemes, and N is a natural number greater than 1; and matching the voice tone features with phonemes in the target phoneme library to determine text phonemes corresponding to the voice information, and identifying texts corresponding to the voice information according to the text phonemes.

According to another aspect of the embodiments of the present invention, there is also provided a voice data processing apparatus, including: the extraction unit is used for extracting phonemes with no tone features and phonemes with tone features in the voice information of the target object, wherein the phonemes correspond to the phonemes of characters to be recognized in the voice, and the tones correspond to tones of the characters to be recognized in the voice; a first obtaining unit, configured to obtain a target phoneme library corresponding to the phonemes through the phonemes, where the target phoneme library includes N tonal phonemes corresponding to the phonemes, and N is a natural number greater than 1; and the determining unit is used for matching the phoneme in the target phoneme library according to the tonal characteristics, determining a text phoneme corresponding to the voice information, and identifying a text corresponding to the voice information according to the text phoneme.

According to still another aspect of the embodiments of the present invention, there is also provided a computer-readable storage medium in which a computer program is stored, wherein the computer program is configured to execute the above-mentioned voice data processing method when running.

According to another aspect of the embodiments of the present invention, there is also provided an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the voice data processing method through the computer program.

In the embodiment of the invention, phonemes with no tone features and phonemes with tone features in the voice information of the target object are extracted, wherein the phonemes correspond to the phonemes of characters to be recognized in the voice, and the tones correspond to tones of the characters to be recognized in the voice; acquiring a target phoneme library corresponding to phonemes through the phonemes, wherein the target phoneme library comprises N tone phonemes corresponding to the phonemes, and N is a natural number greater than 1; the method and the device have the advantages that the phoneme corresponding to the voice information is determined by matching the tonal characteristics with the phonemes in the target phoneme library, the text corresponding to the voice information is recognized according to the text phonemes, the purpose of recognizing the voice information according to the phonemes and the tonal information is achieved, the technical effects that the target phoneme library is determined according to the phoneme characteristics, the text phonemes are determined according to the tonal characteristics are achieved, and the technical problem that in the prior art, the recognition accuracy of the voice information is low is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a schematic diagram of an application environment of an alternative voice data processing method according to an embodiment of the present invention;

FIG. 2 is a flow diagram of an alternative method of processing voice data in accordance with an embodiment of the present invention;

FIG. 3 is a schematic diagram of an alternative voice data processing apparatus according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device according to an alternative voice data processing method in an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

According to an aspect of the embodiments of the present invention, a voice data processing method is provided, and optionally, as an optional implementation manner, the voice data processing method may be applied to, but is not limited to, a voice data processing system in a hardware environment as shown in fig. 1, where the voice data processing system may include, but is not limited to, a terminal device 102, a network 110, and a server 112.

The terminal device 102 may include, but is not limited to: a human-computer interaction screen 104, a processor 106 and a memory 108. The human-computer interaction screen 104 is used for acquiring a human-computer interaction instruction through a human-computer interaction interface; the processor 106 is configured to respond to the human-machine interaction instructions. The memory 108 is used for storing the voice information of the target object, the attribute information of the target phoneme library, and the like. Here, the server may include but is not limited to: the database 114 and the processing engine 116, the processing engine 116 is configured to call a target phoneme library stored in the database 114, compare the target phoneme library with phonemes extracted from the speech information, determine the target phoneme library, perform matching with phonemes in the target phoneme library according to the tonal features, determine text phonemes corresponding to the speech information, and identify a text corresponding to the speech information according to the text phonemes. The aim of recognizing the voice information according to the phonemes and the tone information is achieved, the technical effects that the target phoneme base is determined according to the phoneme characteristics and the text phonemes are determined according to the tone characteristics are achieved, and the technical problem that in the prior art, the voice information recognition accuracy is low is solved.

The specific process comprises the following steps: in the terminal apparatus 102, as in steps S102 to S110, the voice information of the target object is acquired, and the voice information is transmitted to the server 112 via the network 110. Acquiring a target phoneme library corresponding to phonemes through the phonemes at the server 112 according to the phonemes with the non-tonal characteristics and the phonemes with the tonal characteristics in the extracted voice information of the target object, wherein the phonemes correspond to the phonemes of the characters to be recognized in the voice, and the tones correspond to the tones of the characters to be recognized in the voice, wherein the target phoneme library comprises N kinds of tonal phonemes corresponding to the phonemes, and N is a natural number greater than 1; and matching the voice characteristics with phonemes in the target phoneme library to determine text phonemes corresponding to the voice information, and identifying texts corresponding to the voice information according to the text phonemes. And then returns the determined result to the terminal device 102.

Optionally, in this embodiment, the voice data processing method may be, but is not limited to, applied in the server 104 for assisting the voice data processing. The terminal device 102 may be, but not limited to, a mobile phone, a tablet computer, a notebook computer, a PC, and other terminal devices that support running an application client. The server 114 and the terminal device 102 may implement data interaction through a network, which may include but is not limited to a wireless network or a wired network. Wherein, this wireless network includes: bluetooth, WIFI, and other networks that enable wireless communication. Such wired networks may include, but are not limited to: wide area networks, metropolitan area networks, and local area networks. The above is merely an example, and this is not limited in this embodiment.

Optionally, as an optional implementation manner, as shown in fig. 2, the voice data processing method includes:

step S202, extracting phonemes without tone features and phonemes with tone features in the voice information of the target object, wherein the phonemes correspond to the phonemes of characters to be recognized in the voice, and the tones correspond to tones of the characters to be recognized in the voice.

Step S204, a target phoneme library corresponding to the phonemes is obtained through the phonemes, wherein the target phoneme library comprises N tonal phonemes of the corresponding phonemes, and N is a natural number greater than 1.

And step S206, matching the phoneme in the target phoneme library according to the tonal characteristics, determining a text phoneme corresponding to the voice information, and identifying a text corresponding to the voice information according to the text phoneme.

Optionally, in this embodiment, the target object may include, but is not limited to: an attendant who has a conversation with the customer, a shopping guide at the store, etc. The target phone library may include, but is not limited to, phones corresponding to 4 different tones. For example, 4 tones, i.e., one, two, three, and four, are typically included in Chinese.

Specifically, in practical application, homophones (ai) are provided with two submodels, one is a submodel with no tone characteristic, and the other is a submodel for respectively distinguishing tones into first tone, second tone, third tone and fourth tone; during recognition, the target phoneme library is found through the phoneme (ai), and then the homonym (he, wei, ei, he, she, ei, , hich, stechii, ) with one tone in the second submodel is found through tone features (e.g. one tone).

Optionally, in this embodiment, the obtaining a target phoneme library of corresponding phonemes through phonemes includes:

s1, acquiring voice information of a target object in the target industry;

and S2, extracting a reference word bank corresponding to the target voice information from the voice information, and creating a target phoneme bank according to the reference word bank, wherein the target phoneme bank stores the relationship between the phonemes and the corresponding different tones.

It is noted that the target industry may include, but is not limited to, the catering industry, the hospitality industry, etc.

Wherein, creating the target phoneme library according to the reference lexicon may include:

identifying a phoneme library corresponding to the near-sound characters and/or a phoneme library corresponding to the homophone characters in the reference word library;

the phoneme library of the phonogram and/or the phoneme library of the homophone is determined as a target phoneme library.

It should be noted that, recognizing the text corresponding to the speech information according to the text phonemes includes:

and recognizing the text corresponding to the voice information according to the text phoneme and the semantic meaning corresponding to the voice information.

According to the embodiment provided by the application, phonemes with no tone features and phonemes with tone features in the voice information of the target object are extracted, wherein the phonemes correspond to the phonemes of characters to be recognized in the voice, and the tones correspond to tones of the characters to be recognized in the voice; acquiring a target phoneme library corresponding to phonemes through the phonemes, wherein the target phoneme library comprises N tone phonemes corresponding to the phonemes, and N is a natural number greater than 1; the method and the device have the advantages that the phoneme corresponding to the voice information is determined by matching the tonal characteristics with the phonemes in the target phoneme library, the text corresponding to the voice information is recognized according to the text phonemes, the purpose of recognizing the voice information according to the phonemes and the tonal information is achieved, the technical effects that the target phoneme library is determined according to the phoneme characteristics, the text phonemes are determined according to the tonal characteristics are achieved, and the technical problem that in the prior art, the recognition accuracy of the voice information is low is solved.

As an optional scheme, before extracting the phonemes with the unvoiced features and the phonemes with the voiced features in the speech information of the target object, the method further includes:

and acquiring voice information of the target object through the target voice acquisition equipment, wherein the target voice acquisition equipment is arranged at the target position.

As an alternative embodiment, the present application provides a method of speech recognition based on tonal features. The method comprises the following specific steps:

step 1, constructing an industry word bank of the catering service based on big data, and extracting a reference word bank containing the near-sound characters.

Taking the catering industry as an example, the industry word bank of the catering service industry can be established through repeated denoising and regular processing modes of removing the weight, denoising, screening and screening the sensitive words.

And performing data arrangement on the industry word stock of the catering service to obtain a reference word stock, wherein the reference word stock comprises a plurality of reference phrases, the reference phrases are based on pinyin as grouping basis, and words with homophones and near-syllable characters in the industry word stock of the catering service are divided into a group.

Generating a reference lexicon phoneme table (corresponding to a target phoneme table) according to the reference lexicon, wherein tone phonemes are added in the reference lexicon phoneme table; the reference word stock phoneme table comprises a plurality of phoneme submodels (each reference word group corresponds to one phoneme submodel); each phoneme submodel comprises a plurality of phoneme units of words with homophones and near-phonetic characters.

Step 2, feature extraction: the method comprises the steps of obtaining conversation voice signals between service personnel and customers in the catering service process, and processing the conversation voice signals to obtain an acoustic unit without the tone characteristics; and obtaining acoustic units (including phonemes, tones and the like) with tone characteristics by adopting the existing fundamental frequency extraction algorithm.

And 3, matching the acoustic unit without the tonal characteristic with a reference word stock phoneme table for the first time, and finding a corresponding phoneme sub-model (equivalent to a target phoneme stock).

And 4, matching the acoustic unit with the tone characteristic with the phoneme submodel for the second time, distinguishing according to the tone characteristic, finding out a corresponding phoneme unit, and converting the phoneme unit into a specific text.

It should be noted that a "phoneme-tone table" may be constructed, where a tone unit is added to a phoneme in the phoneme submodel, and tone information is added after each phoneme in each phoneme submodel, and the initial consonant is not changed. And then, training a voice recognition model by using the prior art, firstly training the phoneme unit containing the tone in the step 3, and then recognizing the near-sound character by using the voice recognition engine added with the tone.

In this embodiment, step 3 and step 4 are exemplified as follows: setting homophone (ai) into two submodels, one is a submodel with no tone characteristic, and the other is a submodel for respectively distinguishing tones into first tone, second tone, third tone and fourth tone; during recognition, the first sub-model (target phoneme library) is found through phonemes (ai), and then homonyms (seyi, hayi, seyi, angyi, , belch, heyi, ) with tone in the second sub-model are found through tone features (e.g., one tone).

The embodiment provided by the application can achieve the following beneficial effects: the tone features are added, so that the accuracy of voice recognition can be improved; the method of twice recognition is adopted, so that the corresponding phoneme unit can be quickly and accurately locked; the phoneme model and the tone model are distinguished, so that the data volume during model matching each time is reduced, and the recognition speed is improved.

That is, in the present embodiment, by adding a tone feature to a dictionary phoneme corresponding to a near-phonetic character, the accuracy of recognizing the near-phonetic character by the speech recognition apparatus is improved.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

According to another aspect of the embodiment of the present invention, there is also provided a voice data processing apparatus for implementing the above-mentioned voice data processing method. As shown in fig. 3, the voice data processing apparatus includes: an extraction unit 31, a first acquisition unit 33, and a determination unit 35.

An extracting unit 31, configured to extract phonemes with no tonal features and phonemes with tonal features in the speech information of the target object, where the phonemes correspond to the phonemes of characters to be recognized in the speech, and the tones correspond to tones of the characters to be recognized in the speech;

a first obtaining unit 33, configured to obtain a target phoneme library of corresponding phonemes through the phonemes, where the target phoneme library includes N tonal phonemes of the corresponding phonemes, and N is a natural number greater than 1;

the determining unit 35 is configured to match the phoneme in the target phoneme library according to the tonal features, determine a text phoneme corresponding to the speech information, and recognize a text corresponding to the speech information according to the text phoneme.

Wherein, the first obtaining unit 33 includes:

the acquisition module is used for acquiring voice information of a target object in a target industry;

and the creating module is used for extracting a reference word bank corresponding to the target voice information from the voice information and creating a target phoneme bank according to the reference word bank, wherein the target phoneme bank stores the relationship between the phonemes and the corresponding different tones.

Wherein, the creating module comprises:

the recognition submodule is used for recognizing the phoneme library corresponding to the near-sound words and/or the phoneme library corresponding to the homophone words in the reference word library;

and the determining submodule is used for determining the phoneme base of the near-sound words and/or the phoneme base of the homophone words as the target phoneme base.

With the embodiment provided in the present application, the extracting unit 31 extracts phonemes with no tonal features and phonemes with tonal features from the speech information of the target object, where the phonemes correspond to the phonemes of the characters to be recognized in the speech and the tones correspond to the tones of the characters to be recognized in the speech; the first obtaining unit 33 obtains a target phoneme library of corresponding phonemes through the phonemes, wherein the target phoneme library includes N tonal phonemes of the corresponding phonemes, and N is a natural number greater than 1; the determining unit 35 matches the phonemes in the target phoneme library according to the tonal features, determines text phonemes corresponding to the speech information, and identifies a text corresponding to the speech information according to the text phonemes.

As an alternative embodiment, the above apparatus may further comprise:

and a second acquisition unit, configured to acquire the speech information of the target object through a target speech acquisition device before extracting the phonemes of the unvoiced features and the phonemes of the voiced features in the speech information of the target object, where the target speech acquisition device is set at a target position.

According to a further aspect of the embodiments of the present invention, there is also provided an electronic device for implementing the above-mentioned voice data processing method, as shown in fig. 4, the electronic device includes a memory 402 and a processor 404, the memory 402 stores a computer program, and the processor 404 is configured to execute the steps in any one of the above-mentioned method embodiments through the computer program.

Optionally, in this embodiment, the electronic apparatus may be located in at least one network device of a plurality of network devices of a computer network.

Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:

s1, extracting phonemes with no tone features and phonemes with tone features in the voice information of the target object, wherein the phonemes correspond to the phonemes of characters to be recognized in the voice, and the tones correspond to tones of the characters to be recognized in the voice;

s2, acquiring a target phoneme library corresponding to the phonemes through the phonemes, wherein the target phoneme library comprises N tonal phonemes of the corresponding phonemes, and N is a natural number greater than 1;

and S3, matching the phoneme in the target phoneme library according to the tonal characteristics, determining a text phoneme corresponding to the voice information, and identifying a text corresponding to the voice information according to the text phoneme.

Alternatively, it can be understood by those skilled in the art that the structure shown in fig. 4 is only an illustration, and the electronic device may also be a terminal device such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palm computer, and a Mobile Internet Device (MID), a PAD, and the like. Fig. 4 is a diagram illustrating the structure of the electronic device. For example, the electronic device may also include more or fewer components (e.g., network interfaces, etc.) than shown in FIG. 4, or have a different configuration than shown in FIG. 4.

The memory 402 may be used to store software programs and modules, such as program instructions/modules corresponding to the voice data processing method and apparatus in the embodiment of the present invention, and the processor 404 executes various functional applications and data processing by running the software programs and modules stored in the memory 402, that is, implements the voice data processing method. The memory 402 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 402 may further include memory located remotely from the processor 404, which may be connected to the terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof. The memory 402 may be, but not limited to, specifically configured to store information such as speech information of a target object, a target phoneme library, and the like. As an example, as shown in fig. 4, the memory 402 may include, but is not limited to, the extracting unit 31, the first obtaining unit 33, and the determining unit 35 in the voice data processing apparatus. In addition, the present invention may also include, but is not limited to, other module units in the voice data processing apparatus, which are not described in this example again.

Optionally, the transmission device 406 is used for receiving or sending data via a network. Examples of the network may include a wired network and a wireless network. In one example, the transmission device 406 includes a Network adapter (NIC) that can be connected to a router via a Network cable and other Network devices to communicate with the internet or a local area Network. In one example, the transmission device 406 is a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.

In addition, the electronic device further includes: a display 408 for displaying the determined text; and a connection bus 410 for connecting the respective module parts in the above-described electronic apparatus.

According to a further aspect of an embodiment of the present invention, there is also provided a computer-readable storage medium having a computer program stored thereon, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.

Alternatively, in the present embodiment, the above-mentioned computer-readable storage medium may be configured to store a computer program for executing the steps of:

Alternatively, in this embodiment, a person skilled in the art may understand that all or part of the steps in the methods of the foregoing embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

The integrated unit in the above embodiments, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in the above computer-readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing one or more computer devices (which may be personal computers, servers, network devices, etc.) to execute all or part of the steps of the method according to the embodiments of the present invention.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method for processing voice data, comprising:

extracting phonemes with no tone features and phonemes with tone features in the voice information of the target object, wherein the phonemes correspond to the phonemes of characters to be recognized in the voice, and the tones correspond to tones of the characters to be recognized in the voice;

acquiring a target phoneme library corresponding to the phonemes through the phonemes, wherein the target phoneme library comprises N tone phonemes corresponding to the phonemes, and N is a natural number greater than 1;

and matching the voice tone features with phonemes in the target phoneme library to determine text phonemes corresponding to the voice information, and identifying texts corresponding to the voice information according to the text phonemes.

2. The method of claim 1, wherein obtaining a target phone library corresponding to the phone from the phone comprises:

acquiring voice information of a target object in a target industry;

extracting a reference word bank corresponding to target voice information from the voice information, and creating the target phoneme bank according to the reference word bank, wherein the target phoneme bank stores relations between phonemes and corresponding different tones.

3. The method of claim 2, wherein creating the target phone library from the reference thesaurus comprises:

identifying a phoneme library corresponding to the near-sound words and/or a phoneme library corresponding to the homophone words in the reference word library;

and determining the phoneme base of the phonogram and/or the phoneme base of the homophone as the target phoneme base.

4. The method of claim 1, wherein before extracting the phonemes with unvoiced features and the phonemes with voiced features in the speech information of the target object, the method further comprises:

and acquiring voice information of the target object through target voice acquisition equipment, wherein the target voice acquisition equipment is arranged at a target position.

5. The method of claim 1, wherein identifying the text corresponding to the speech information according to the text phoneme comprises:

6. A speech data processing apparatus, comprising:

the extraction unit is used for extracting phonemes with no tone features and phonemes with tone features in the voice information of the target object, wherein the phonemes correspond to the phonemes of characters to be recognized in the voice, and the tones correspond to tones of the characters to be recognized in the voice;

a first obtaining unit, configured to obtain a target phoneme library corresponding to the phonemes through the phonemes, where the target phoneme library includes N tonal phonemes corresponding to the phonemes, and N is a natural number greater than 1;

and the determining unit is used for matching the phoneme in the target phoneme library according to the tonal characteristics, determining a text phoneme corresponding to the voice information, and identifying a text corresponding to the voice information according to the text phoneme.

7. The apparatus of claim 6, wherein the first obtaining unit comprises:

and the creating module is used for extracting a reference word bank corresponding to the target voice information from the voice information and creating the target phoneme bank according to the reference word bank, wherein the target phoneme bank stores the relationship between the phonemes and the corresponding different tones.

8. The apparatus of claim 7, wherein the creation module comprises:

the recognition submodule is used for recognizing a phoneme library corresponding to the near-sound words and/or a phoneme library corresponding to the homophone words in the reference word library;

9. The apparatus of claim 6, further comprising:

the second acquisition unit is used for acquiring the voice information of the target object through a target voice acquisition device before extracting the phonemes of the unvoiced features and the phonemes of the voiced features in the voice information of the target object, wherein the target voice acquisition device is arranged at a target position.

10. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method of any of claims 1 to 5 by means of the computer program.