CN111402864A

CN111402864A - Voice processing method and electronic equipment

Info

Publication number: CN111402864A
Application number: CN202010196188.7A
Authority: CN
Inventors: 卢露露; 冯大航; 陈孝良
Original assignee: Beijing SoundAI Technology Co Ltd
Current assignee: Beijing SoundAI Technology Co Ltd
Priority date: 2020-03-19
Filing date: 2020-03-19
Publication date: 2020-07-10

Abstract

The invention provides a voice processing method and electronic equipment, wherein the method comprises the following steps: acquiring to-be-processed voice sent by a first client; and acquiring a text corresponding to the voice to be processed based on a target language model, wherein the target language model is a language model obtained based on fusion of a general language model and a target special language model, the general language model is obtained based on training of general corpus data, and the target special language model is obtained based on training of corpus data sent by the first client. The embodiment of the invention can improve the voice recognition effect.

Description

Voice processing method and electronic equipment

Technical Field

The present invention relates to the field of natural language processing technologies, and in particular, to a speech processing method and an electronic device.

Background

Natural language Processing (Natural L and guide Processing, N L P) is the field of computer science, artificial intelligence, linguistics that focus on the interaction between computer and human (Natural) language.

The natural language processing technology can realize the processing of the voice based on the language model, and in the prior art, when the service related to the voice recognition is provided for enterprise-level customers, because vocabularies in the voice to be processed relate to the professional field, the effect of adopting the universal language model to perform the voice recognition is poor.

Disclosure of Invention

The embodiment of the invention provides a voice processing method and electronic equipment, and aims to solve the problem that in the prior art, as vocabularies in voice to be processed relate to the professional field, the effect of voice recognition by adopting a universal language model is poor.

In order to solve the technical problem, the invention is realized as follows:

in a first aspect, an embodiment of the present invention provides a speech processing method, where the method includes:

acquiring to-be-processed voice sent by a first client;

and acquiring a text corresponding to the voice to be processed based on a target language model, wherein the target language model is a language model obtained based on fusion of a general language model and a target special language model, the general language model is obtained based on training of general corpus data, and the target special language model is obtained based on training of corpus data sent by the first client.

In a second aspect, an embodiment of the present invention provides an electronic device, where the electronic device includes:

the first acquisition module is used for acquiring the voice to be processed sent by the first client;

and the second obtaining module is used for obtaining the text corresponding to the voice to be processed based on a target language model, wherein the target language model is a language model obtained based on fusion of a general language model and a target special language model, the general language model is obtained based on training of general corpus data, and the target special language model is obtained based on training of corpus data sent by the first client.

In a third aspect, an embodiment of the present invention provides an electronic device, including: a memory, a processor and a program stored on the memory and executable on the processor, which program, when executed by the processor, performs the steps in the speech processing method according to the first aspect.

In a fourth aspect, the embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps in the speech processing method according to the first aspect.

In the embodiment of the invention, to-be-processed voice sent by a first client side is obtained; and acquiring a text corresponding to the voice to be processed based on a target language model, wherein the target language model is a language model obtained based on fusion of a general language model and a target special language model, the general language model is obtained based on training of general corpus data, and the target special language model is obtained based on training of corpus data sent by the first client. Therefore, when the service related to the voice recognition is provided for the enterprise-level client, the processing can be carried out based on the target special language model corresponding to the enterprise, and the voice recognition effect can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.

FIG. 1 is a flow chart of a method for processing speech according to an embodiment of the present invention;

FIG. 2 is an architecture diagram of a language model provided by an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention;

fig. 4 is a second schematic structural diagram of an electronic device according to an embodiment of the invention;

fig. 5 is a third schematic structural diagram of an electronic apparatus according to an embodiment of the invention;

fig. 6 is a fourth schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the embodiment of the present invention, the electronic device includes, but is not limited to, a server, a platform device, a tablet computer, a notebook computer, a palm computer, a mobile terminal, and the like.

Referring to fig. 1, fig. 1 is a flowchart of a speech processing method according to an embodiment of the present invention, as shown in fig. 1, including the following steps:

step 101, obtaining a voice to be processed sent by a first client.

The voice processing method can be applied to electronic equipment, the first client can be an enterprise client, and a target enterprise can be determined based on the first client. For example, when a target enterprise has a voice parsing requirement, the to-be-processed voice can be sent to the electronic device through the first client. The electronic equipment can be connected with a plurality of clients and can carry out voice processing on the voice to be processed sent by the clients.

102, acquiring a text corresponding to the voice to be processed based on a target language model, wherein the target language model is a language model obtained by fusing a general language model and a target special language model, the general language model is obtained by training based on general corpus data, and the target special language model is obtained by training based on corpus data sent by the first client.

The electronic device may store a general language model and a plurality of special language models, the plurality of special language models may be obtained by training based on corpus data sent by different clients, and the target special language model may be a language model obtained by training based on corpus data sent by the first client among the plurality of special language models. The speech to be processed may be processed based on the target language model to obtain a probability estimate for each predicted text in a plurality of predicted texts corresponding to the speech to be processed, and the decoder may determine a text corresponding to the speech to be processed in the plurality of predicted texts according to the probability estimate for each predicted text. In the process of speech recognition, the acoustic score and the language score of the speech to be processed can be obtained through the acoustic model and the target language model, and the decoder can determine the text corresponding to the speech to be processed as the speech recognition result based on the acoustic score and the language score.

In addition, the target language model may be a language model obtained by performing interpolation fusion on the general language model and the target specific language model, or may also be a language model obtained by performing interpolation fusion on the general language model and the target specific language model, or processing the speech to be processed by respectively adopting the general language model and the target specific language model, and taking a mean value of a language score obtained by the general language model and a language score obtained by the target specific language model as a language score of the target language model; alternatively, other fusion methods may be adopted to obtain the target language model, which is not limited in the embodiment of the present invention.

It should be noted that the corpus data sent by the first client may be corpus data including professional vocabularies, the general corpus data may be common corpus data in life, different clients correspond to different categories of corpus data, and the recognition accuracy of the professional vocabularies of the enterprise may be improved through the target-specific language model. Taking a bank client as an example, the corpus data sent by the bank client may include professional vocabularies related to banking business, for example, a business "swiftlet card" in the bank is similar to an "experience card" in the general corpus, and currently, in the process of voice recognition, the "swiftlet card" is recognized as an "experience card" with a high probability. In the embodiment, when the target special language model corresponding to the bank client is trained, the professional vocabulary related to the banking business is added into the training corpus, and the recognition accuracy of the professional vocabulary related to the banking business can be improved in the voice recognition process.

Further, before the obtaining the text corresponding to the speech to be processed based on the target language model, the method may further include: receiving corpus data sent by the first client; training the target specific language model based on the corpus data sent by the first client; and fusing the target special language model and the general language model obtained by training to obtain the target language model.

It should be noted that, in a speech recognition application scenario, the text corresponding to the speech to be processed is obtained based on the target language model, and the text corresponding to the speech to be processed is obtained by performing speech recognition on the speech to be processed based on the target language model; in a machine translation application scenario, machine translation may be performed on the speech to be processed based on a target language model to obtain a text corresponding to the speech to be processed, and the text corresponding to the speech to be processed may be obtained based on the target language model in different application scenarios.

Further, the obtaining of the to-be-processed voice sent by the first client may be obtaining a voice recognition request sent by the first client, where the voice recognition request includes the to-be-processed voice and a target identifier, and the target identifier is used to identify the first client; before the obtaining of the text corresponding to the speech to be processed based on the target language model, the method further includes: determining the target language model based on the target identification; or, the speech to be processed may be processed based on a plurality of language models, a highest language score obtained by each language model is determined, and a text corresponding to the highest language score may be used as the text corresponding to the speech to be processed.

In practical application, taking electronic equipment as a server as an example, each enterprise can upload corpus data to the server through a client of each enterprise, the server can acquire the corpus data uploaded by each enterprise, and after the uploaded corpus data can be cleaned, the corpus data uploaded by each enterprise is adopted to respectively train the special language models of each enterprise. When the corpora data are uploaded by each enterprise, generalization can be carried out as much as possible, so that the voice recognition effect can be improved.

As shown in fig. 2, a general language model and a plurality of special language models may be stored in the electronic device, and the plurality of special language models may be obtained by training corpus data sent by different clients, so that each enterprise may customize its own special language model, and may perform speech recognition on speech including professional vocabularies of the enterprise, thereby improving the recognition accuracy of the enterprise's private words. The electronic equipment can also be used for voice recognition of other types of voice, such as a chat type, a music type, a video type, a weather type and the like, so that the electronic equipment has various functions and can be well recognized in all functions. Each category can be correspondingly provided with a first language model for judging the category, and the probability that the speech to be processed belongs to each category can be obtained based on the first language model. When the to-be-processed voice is received, the to-be-processed voice can be judged by adopting the first language model to which each category belongs, the probability corresponding to each category is obtained, and the voice recognition branch corresponding to the category with the highest probability can be adopted to perform voice recognition on the to-be-processed voice.

Optionally, the target language model is a language model obtained by performing interpolation fusion on the general language model and the target specific language model.

Interpolation can be carried out through the target special language model and the general language model, interpolation fusion can be carried out through the general language model and the target special language model, low-frequency word estimation can be carried out, and therefore the target language model has universality. The linear interpolation can be carried out by the target specific language model and the general language model, and the linear interpolation can simultaneously utilize information of different orders, for example, in the ngram language model, the larger n is, the more contexts are utilized correspondingly, but the estimation probability is sparse. Due to sparseness, it may happen that a certain high-order ngram does not appear in the corpus, and if it is directly considered that the high-order ngram probability is zero, it is obviously inappropriate, a better method is to fall back to (n-1) gram, thereby realizing linear interpolation. Taking the target language model as a bi-gram language model as an example, the probability that the target language model describes the occurrence of one text may be: p3 ═ AP1+ (1-a) P2, a is a weight value between 0 and 1, P1 is a probability value calculated by the generic language model, and P2 is a low-order probability provided by the target-specific language model.

In this embodiment, the target language model is obtained by performing interpolation fusion on the general language model and the target specific language model, and through interpolation fusion, the data processing capability of the general language model can be retained, and meanwhile, the occurrence probability is increased when the target language model is processed for professional words that the general language model is not good at processing, so that the target language model can be better processed for non-professional words and professional words, and the voice processing effect can be further improved.

Optionally, the method further includes:

and under the condition that the target special language model is updated, fusing the updated target special language model with the general language model to obtain an updated target language model.

After the first client sends the corpus data, the target specific language model may be trained based on the corpus data sent by the first client, and the target specific language model may be updated. The electronic device may have stored thereon a first identifier for identifying the target specific language model, which may be an MD5 value of a file storing the target specific language model. In a case where the target specific language model sends an update, the first identifier is changed, and the decoder for speech recognition may determine that the target specific language model is updated based on the first identifier, may fuse the updated target specific language model with the general language model to obtain an updated target language model, and may load the updated target language model into the memory.

In this embodiment, when the target specific language model is updated, the updated target specific language model and the general language model are fused to obtain an updated target language model, so that the target language model can be updated in real time, the target language model is iterated without human intervention, the degree of intelligence is high, and when a plurality of enterprise clients exist, the language models corresponding to the plurality of enterprise clients can be simultaneously updated.

Optionally, the obtaining of the to-be-processed voice sent by the first client includes:

acquiring a voice recognition request sent by a first client, wherein the voice recognition request comprises a voice to be processed and a target identifier, and the target identifier is used for identifying the first client;

before the obtaining of the text corresponding to the speech to be processed based on the target language model, the method further includes:

determining the target language model based on the target identification.

The target identification can be set by the electronic device in a default mode, the electronic device can send the set target identification to the first client, and the first client can store the target identification; or, the enterprise user may set a target identifier on the first client, and when the first client sends corpus data to the electronic device to train the target specific language model, the target identifier may be carried, and the electronic device may store the target identifier, the target specific language model, and a corresponding relationship between the target identifier and the target language model.

In this embodiment, a voice recognition request sent by a first client is obtained, where the voice recognition request includes a to-be-processed voice and a target identifier, the target identifier is used to identify the first client, and the target language model is determined based on the target identifier, so that determining the target language model through the target identifier avoids using an incorrect language model for the voice recognition request of the first client, and can improve the efficiency of voice processing.

Optionally, the obtaining the text corresponding to the speech to be processed based on the target language model includes:

and performing voice recognition on the voice to be processed based on a target language model to obtain a text corresponding to the voice to be processed.

The electronic device includes a decoder capable of being used for speech recognition, a target identifier may be added to the decoder, the target identifier is used to identify the first client, a correspondence between the target identifier and a target language model may be stored in the decoder, and when a plurality of language models are stored in the electronic device, the decoder may determine the target language model from the plurality of language models according to the target identifier corresponding to the first client.

In the embodiment, the speech to be processed is subjected to speech recognition based on the target language model to obtain the text corresponding to the speech to be processed, so that the accuracy of speech recognition on the speech of the enterprise customer can be improved, and the interactive experience of the enterprise customer is improved.

Referring to fig. 3, fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 3, the electronic device 200 includes:

a first obtaining module 201, configured to obtain a to-be-processed voice sent by a first client;

a second obtaining module 202, configured to obtain a text corresponding to the speech to be processed based on a target language model, where the target language model is a language model obtained based on a fusion of a general language model and a target-specific language model, the general language model is obtained based on a training of general corpus data, and the target-specific language model is obtained based on a training of corpus data sent by the first client.

Optionally, as shown in fig. 4, the electronic device 200 further includes:

and the updating module 203 is configured to fuse the updated target specific language model and the general language model to obtain an updated target language model when the target specific language model is updated.

Optionally, the first obtaining module 201 is specifically configured to:

as shown in fig. 5, the electronic device 200 further includes:

a determining module 204, configured to determine the target language model based on the target identifier.

Optionally, the second obtaining module 202 is specifically configured to:

The electronic device can implement each process implemented in the method embodiment of fig. 1, and is not described here again to avoid repetition.

Referring to fig. 6, fig. 6 is a fourth schematic structural diagram of an electronic device according to an embodiment of the present invention, as shown in fig. 6, the electronic device 300 includes: a memory 302, a processor 301, and a program stored on the memory 302 and executable on the processor 301, wherein:

the processor 301 reads the program in the memory 302 for executing:

acquiring to-be-processed voice sent by a first client;

Optionally, the processor 301 is further configured to perform:

Optionally, the obtaining, by the processor 301, to-be-processed voice sent by the first client includes:

determining the target language model based on the target identification.

Optionally, the obtaining, by the processor 301, a text corresponding to the speech to be processed based on the target language model includes:

In fig. 6, the bus architecture may include any number of interconnected buses and bridges, with one or more processors represented by processor 301 and various circuits of memory represented by memory 302 being linked together. The bus architecture may also link together various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. The bus interface provides an interface.

The processor 301 is responsible for managing the bus architecture and general processing, and the memory 302 may store data used by the processor 301 in performing operations.

It should be noted that any implementation manner in the method embodiment of the present invention may be implemented by the electronic device in this embodiment, and achieve the same beneficial effects, and details are not described here.

The embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements each process of the foregoing speech processing method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here. The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A method of speech processing, the method comprising:

acquiring to-be-processed voice sent by a first client;

2. The method according to claim 1, wherein the target language model is a language model obtained by interpolation fusion of the general language model and the target specific language model.

3. The method of claim 1, further comprising:

4. The method of claim 1, wherein the obtaining the to-be-processed speech sent by the first client comprises:

determining the target language model based on the target identification.

5. The method according to claim 1, wherein the obtaining the text corresponding to the speech to be processed based on the target language model comprises:

6. An electronic device, characterized in that the electronic device comprises:

7. The electronic device of claim 6, wherein the target language model is a language model obtained by interpolation fusion of the general language model and the target specific language model.

8. The electronic device of claim 6, further comprising:

and the updating module is used for fusing the updated target special language model and the general language model under the condition that the target special language model is updated to obtain the updated target language model.

9. The electronic device of claim 6, wherein the first obtaining module is specifically configured to:

the electronic equipment also comprises

A determination module to determine the target language model based on the target identification.

10. The electronic device of claim 6, wherein the second obtaining module is specifically configured to:

11. An electronic device, comprising: memory, processor and program stored on the memory and executable on the processor, which when executed by the processor implements the steps in the speech processing method according to any of claims 1 to 5.