CN110914898B

CN110914898B - System and method for speech recognition

Info

Publication number: CN110914898B
Application number: CN201880047060.5A
Authority: CN
Inventors: 李秀林
Original assignee: Beijing Didi Infinity Technology and Development Co Ltd
Current assignee: Beijing Didi Infinity Technology and Development Co Ltd
Priority date: 2018-05-28
Filing date: 2018-05-28
Publication date: 2024-05-24
Anticipated expiration: 2038-05-28
Also published as: CN110914898A; WO2019227290A1

Abstract

The present application relates to a system and method for speech recognition. The system may perform the method to obtain a target audio signal comprising speech of a speaker and determine one or more speech characteristics of the target audio signal. The system may also perform the method to obtain an accent vector of the speaker. The system may further perform the method of inputting speech features of the target audio signal and accent vectors of the speaker into a trained speech recognition neural network model to translate speech into a target content form, and generating an interface through an output device to present the speech in the target content form.

Description

System and method for speech recognition

Technical Field

The present application relates generally to systems and methods for speech recognition, and more particularly to systems and methods for recognizing accent speech using Artificial Intelligence (AI).

Background

Automatic speech recognition is an important technology that can recognize and translate spoken language into computer-recognized text synonyms. In general, automatic speech recognition attempts provide accurate recognition results for speech in different languages and accents. However, recognizing and translating accented speech is challenging because accented speech utterances can lead to false recognition and word recognition failures. It is therefore desirable to provide AI systems and methods that are capable of recognizing accented speech and translating the accented speech into a desired form, such as text content or audio speech having another predetermined accent or language.

Disclosure of Invention

According to one aspect of the present application, a system is provided. The system may include at least one audio signal input device, at least one storage medium, and at least one processor. The at least one audio signal input device may be configured to receive speech of a speaker. The at least one storage medium may contain a set of instructions for speech recognition. The at least one processor may be in communication with the at least one storage medium. The at least one processor, when executing the set of instructions, may be configured to perform one or more of the following operations. The at least one processor may obtain a target audio signal comprising a speaker's speech from the audio signal input device and determine one or more speech characteristics of the target audio signal. The at least one processor may also obtain at least one accent vector of the speaker and input one or more speech features of the target audio signal and the at least one accent vector of the speaker into a trained speech recognition neural network model to translate the speech into a target content form. The at least one processor may further generate an interface through an output device to present the speech in the target content form.

In some embodiments, to input the one or more speech features of the input audio signal and the accent of the speaker to a trained neural network model for speech recognition, the at least one processor may be configured to obtain a local accent of the speech origin region and input the one or more speech features of the target audio signal, the at least one accent vector of the speaker, and a local accent to the trained speech recognition neural network model.

In some embodiments, the accent vector comprises at least two elements. Each element may correspond to a region accent and include likelihood values associated with the region accent.

In some embodiments, at least one accent vector of the speaker may be determined based on an accent determination process. The accent determination process may include obtaining a historical audio signal including one or more historical voices of the speaker, and determining, for each of the one or more historical voices, one or more historical voice characteristics of the respective historical audio signal. The accent determination process may further include obtaining one or more regional accent models, and for each of the one or more historical voices, inputting one or more respective historical voice features into each of the one or more regional accent models. The accent determination process may further include determining at least two elements of the at least one accent vector of the speaker based on at least one output of the one or more regional accent models.

In some embodiments, the trained speech recognition neural network model may be generated by at least one computing device based on a training process. The training process may include obtaining a sample audio signal comprising at least two sample voices of at least two sample speakers, and for each of the at least two sample voices, determining one or more sample voice characteristics of the respective sample audio signal. The training process may further include, for each of the at least two sample voices, obtaining at least one sample accent vector for a respective sample speaker, and obtaining an initial neural network model. The training process may further include determining the trained speech recognition neural network model by inputting one or more sample speech features of a sample speaker corresponding to each of the at least two sample voices and the at least one sample accent vector of the sample speaker to the initial neural network model.

In some embodiments, the determining the trained speech recognition neural network model may include, for each of the at least two sample voices, obtaining a sample local accent of the sample audio signal origin region. The determining a trained speech recognition neural network model may further include determining the trained speech recognition neural network model by inputting the one or more sample speech features corresponding to each of the at least two sample voices, the at least one sample accent vector of the sample speaker, and the sample local accent to the initial neural network model.

In some embodiments, the target content form may include at least one of phonemes, syllables, letters.

In some embodiments, to input the one or more speech features of the target audio signal and the at least one accent vector of the speaker to a trained speech recognition neural network model to translate speech into a target content form, the at least one processor may be configured to input the one or more speech features and the at least one accent vector of the speaker to a trained speech recognition neural network model and translate the speech into the target content form based on at least one output of the trained neural network model.

According to another aspect of the application, a method is provided. The method may be implemented on a computing device having at least one processor, at least one memory. The method may include obtaining a target audio signal including a speaker's voice from an audio signal input device and determining one or more voice characteristics of the target audio signal. The method may further include obtaining at least one accent vector of the speaker and inputting the one or more speech features of the target audio signal and the at least one accent vector of the speaker into a trained speech recognition neural network model to translate the speech into a target content form. The method may further include generating, by the output device, an interface to present the speech in the target content form.

According to yet another aspect of the present application, a non-transitory computer readable medium is provided. The non-transitory computer-readable medium may include executable instructions that, when executed by at least one processor, cause the at least one processor to implement a method. The method may include obtaining a target audio signal including a speaker's voice from an audio signal input device and determining one or more voice characteristics of the target audio signal. The method may further include obtaining at least one accent vector of the speaker and inputting the one or more speech features of the target audio signal and the at least one accent vector of the speaker to a trained speech recognition neural network to translate the speech into a target content form. The method may further include generating, by the output device, an interface to present the speech in the target content form.

Additional features of the application will be set forth in part in the description which follows. Additional features of the application will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following description and the accompanying drawings or may be learned from production or operation of the embodiments. The features of the present application may be implemented and realized in the practice or use of the methods, instrumentalities and combinations of various aspects of the specific embodiments described below.

Drawings

The application will be further described by way of exemplary embodiments, which will be described in detail with reference to the accompanying drawings. The embodiments are not limiting, in which like numerals represent like structures, wherein:

FIG. 1 is a schematic diagram of an exemplary speech recognition system shown in accordance with some embodiments of the present application;

FIG. 2 is a block diagram of exemplary hardware and/or software components of a computing device shown according to some embodiments of the application;

FIG. 3 is a block diagram of exemplary hardware and/or software components of an exemplary mobile device shown in accordance with some embodiments of the present application;

FIG. 4A is a schematic diagram of an exemplary processing engine shown in accordance with some embodiments of the present application;

FIG. 4B is a schematic diagram of an exemplary processing engine shown in accordance with some embodiments of the application;

FIG. 5 is a flowchart illustrating an exemplary process of speech recognition according to some embodiments of the application;

FIG. 6 is a flowchart illustrating an exemplary process for determining an accent vector for a speaker based on one or more regional accent models, according to some embodiments of the application;

FIG. 7 is a flowchart illustrating an exemplary process for determining an accent vector for a speaker based on an accent classification model, according to some embodiments of the application;

FIG. 8 is a flowchart illustrating an exemplary process for generating a trained voice recognition neural network, according to some embodiments of the application; and

FIG. 9 is a schematic diagram of an exemplary trained speech recognition neural network model, shown according to some embodiments of the application.

Detailed Description

The following description is presented to enable one of ordinary skill in the art to make and use the application and is provided in the context of a particular application and its requirements. It will be apparent to those having ordinary skill in the art that various changes can be made to the disclosed embodiments and that the general principles defined herein may be applied to other embodiments and applications without departing from the principles and scope of the application. Therefore, the present application is not limited to the described embodiments, but is to be accorded the widest scope consistent with the claims.

The terminology used in the present application is for the purpose of describing particular exemplary embodiments only and is not intended to be limiting of the scope of the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

These and other features, characteristics, and functions of related elements of structure and methods of operation, as well as combinations of parts and economies of manufacture, of the present application will become more apparent upon consideration of the following description of the drawings, all of which form a part of this specification. It is to be understood, however, that the drawings are designed solely for the purposes of illustration and description and are not intended as a definition of the limits of the application. It should be understood that the figures are not drawn to scale.

A flowchart is used in the present application to illustrate the operations performed by a system according to some embodiments of the present application. It should be understood that the operations in the flow diagrams may be performed out of order. Rather, the various steps may be processed in reverse order or simultaneously. Also, one or more other operations may be added to these flowcharts. One or more operations may also be deleted from the flowchart.

In the present application, positioning technologies based on Global Positioning System (GPS), global navigation satellite system (GLONASS), COMPASS navigation system (COMPASS), galileo positioning system, quasi Zenith Satellite System (QZSS), wireless fidelity (WiFi) positioning technology, etc. or any combination thereof are used. One or more of the above positioning techniques may be used interchangeably in the present application.

One aspect of the present application relates to a system and method for speech recognition. In speech recognition, the accent of the speaker may affect the recognition accuracy, which needs to be considered. According to the present application, when the system recognizes the speech of a speaker, the system may determine one or more speech features of the audio signal of the speech. The system may also obtain an accent vector for the speaker. The accent vector may include at least two elements, each element corresponding to a region accent and indicating a similarity between the speaker's accent and a particular region accent. The speech features may be input to a trained speech recognition neural network model along with accent vectors to translate speech into target content forms, such as phonemes, words, and/or sounds. Because the recognition system and method takes into account the speech characteristics of the speech itself and the accent characteristics of the speaker, the system and method improves the accuracy of the recognition results. In addition, with higher accuracy recognition rates, the systems and methods may also use artificial intelligence to translate recognized speech into other forms of expression, such as translating and displaying the original speech as text content, an audio track of another accent, and/or an audio track of another language, etc.

FIG. 1 is a schematic diagram of an exemplary artificial intelligence speech recognition system 100 shown in accordance with some embodiments of the present application. As shown in fig. 1, an artificial intelligence speech recognition system 100 (referred to as speech recognition system 100 for brevity) may include a server 110, a network 120, an input device 130, an output device 140, and a storage device 150.

In the speech recognition system 100, a speaker 160 may input spoken speech 170 into an input device 130, which may generate an audio signal that includes or encodes the speech 170. Input device 130 may provide audio signals and optional information about speaker 160 and/or input device 130 to server 110 via network 120. Information related to speaker 160 and/or input device 130 may include, for example, a user profile of speaker 160, location information of speaker 160 and/or input device 130, and the like, or any combination thereof. The server 110 can process the audio signal and optional information related to the speaker 160 and/or the input device 130 to translate the speech 170 into a target content form, such as phonemes, words, and/or sounds. The translation of the speech 170 may also be sent to the output device 140 for presentation via the network 120.

In some embodiments, the server 110 may be a single server or a group of servers. The server farm may be centralized or distributed (e.g., server 110 may be a distributed system). In some embodiments, server 110 may be local or remote. For example, server 110 may access information and/or data stored in input device 130, output device 140, and/or storage device 150 via network 120. As another example, server 110 may be directly connected to input device 130, output device 140, and/or storage device 150 to access stored information and/or data. In some embodiments, server 110 may be implemented on a cloud platform. For example only, the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, an internal cloud, a multi-layer cloud, or the like, or any combination thereof. In some embodiments, server 110 may execute on a computing device 200 described in FIG. 2 that includes one or more components in the present application.

In some embodiments, server 110 may include a processing engine 112. The processing engine 112 may process information and/or data to perform one or more of the functions described in the present disclosure. For example, based on the trained speech recognition neural network model, the speech features of the speech 170 and/or the accent vector of the speaker 160, the processing engine 112 may translate the speech 170 into a target content form. For another example, the processing engine 112 may train a trained speech recognition neural network model using training samples. In some embodiments, the processing engine 112 may include one or more processing engines (e.g., a single chip processing engine or a multi-chip processing engine). By way of example only, the processing engine 112 may include one or more hardware processors, such as a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), an application specific instruction set processor (ASIP), an image processing unit (GPU), a physical arithmetic processing unit (PPU), a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), a Programmable Logic Device (PLD), a controller, a microcontroller unit, a Reduced Instruction Set Computer (RISC), a microprocessor, and the like, or any combination thereof.

In some embodiments, at least a portion of server 110 may be integrated into input device 130 and/or output device 140. For example only, the processing engine 112 may be integrated into the input device 130. For example, the input device 130 may be a smart recorder with artificial intelligence, which may include a microprocessor and memory (e.g., hard disk) to translate recorded speech directly into text content and to store the text content in memory. When the processing engine 112 and the input device 130 are integrated together into a single device, the network 120 of FIG. 1 between the processing engine 112 and the input device 130 may become unnecessary because all communications therebetween become local. For illustrative purposes only, the present application refers to server 110 and input device 130 as separate devices for the example of speech recognition system 100.

The network 120 may facilitate the exchange of information and/or data. In some embodiments, one or more components in speech recognition system 100 (e.g., server 110, input device 130, output device 140, storage device 150) may send information and/or data to other components in speech recognition system 100 via network 120. For example, server 110 may obtain/acquire a request to translate voice 170 from input device 130 via network 120. In some embodiments, the network 120 may be a wired network, a wireless network, or the like, or any combination thereof. By way of example only, the network 120 may include a cable network, a wired network, a fiber optic network, a telecommunications network, an intranet, the internet, a Local Area Network (LAN), a Wide Area Network (WAN), a Wireless Local Area Network (WLAN), a Metropolitan Area Network (MAN), a Public Switched Telephone Network (PSTN), a bluetooth network, a ZigBee network, a Near Field Communication (NFC) network, or the like, or any combination thereof. In some embodiments, network 120 may include one or more network access points. For example, network 120 may include wired or wireless network access points, such as base stations and/or Internet switching points 120-1, 120-2, … …, through which one or more components of speech recognition system 100 may connect to network 120 to exchange data and/or information.

The input device 130 may be configured to receive sound input from a user and generate audio signals and/or encode sound input. For example, as shown in FIG. 1, input device 130 may receive speech 170 from speaker 160. The input device 130 may be a sound input device or a device including an acoustic input component (e.g., a microphone). Exemplary input devices 130 may include a mobile device 130-1, an earphone 130-2, a microphone 130-3, a music player, a recorder, an electronic book reader, a navigation device, a tablet, a laptop, a motor vehicle built-in device, a stylus, and the like, or any combination thereof. The mobile device 130-1 may include a smart home device, a wearable device, a mobile device, a virtual reality device, an augmented reality device, etc., or any combination thereof. In some embodiments, the smart home devices may include smart lighting devices, smart appliance control devices, smart monitoring devices, smart televisions, smart cameras, interphones, and the like, or any combination thereof. In some embodiments, the wearable device may include a wristband, footwear, glasses, helmet, watch, clothing, backpack, smart accessory, etc., or any combination thereof. In some embodiments, the mobile device may include a mobile phone, a Personal Digital Assistant (PDA), a gaming device, a navigation device, a point of sale (POS), a laptop, a desktop, etc., or any combination thereof. In some embodiments, the virtual reality device and/or augmented virtual reality device may include a virtual reality helmet, virtual reality glasses, virtual reality eyepieces, augmented reality helmet, augmented reality glasses, augmented reality eyepieces, and the like, or any combination thereof. For example, the virtual reality device and/or the augmented reality device may include Google Glass ^TM、RiftCon^TM、Fragments^TM、Gear VR^TM, or the like. In some embodiments, the built-in devices in the motor vehicle may include an on-board computer, an on-board television, and the like. In some embodiments, the input device 130 may be a device with positioning technology for positioning the user and/or the location of the input device 130.

In operation, input device 130 may activate a voice recording session to record an audio signal that includes voice 170 of speaker 160. In some embodiments, the voice recording session may be initiated automatically upon the input device 130 detecting a sound or sounds meeting a condition. For example, the condition may be that the sound from a particular speaker 160 is speech (in a human language), speech or sound (e.g., the timbre of the sound), and/or the loudness of the speech or sound is greater than a threshold, etc. Additionally or alternatively, the voice recording session may be initiated by a particular action taken by speaker 160. For example, the speaker 160 may press a button or area on the interface of the input device 130 prior to speaking, speak the utterance, and then release the button or area upon completion of the speaking. For another example, speaker 160 may initiate a voice recording session by making a predetermined gesture or sound. In some embodiments, the input device 130 may also send an audio signal including the speech 170 to the server 110 (e.g., the processing engine 112) for speech recognition.

In some embodiments, speaker 160 may be a user of input device 130. Alternatively, speaker 160 may be a person other than the user of input device 130. For example, user A of input device 130 may use input device 130 to input user B's speech. In some embodiments, "user" and "speaker" may be used interchangeably. For ease of description, the "user" and "speaker" are collectively referred to as "speaker".

After the processing engine "translates" the recorded speech into a desired form (e.g., text content, audio content, etc.), the desired form may be sent to another device for presentation. For example, the desired form may be stored in the storage device 150. The desired form may also be presented by the output device 140. The output device 140 may be configured to output and/or display voice information in a desired form. In some embodiments, the output device 140 may output and/or display information in a human-viewable manner. For example, the information output and/or displayed by the output device 140 may be in a format such as text, images, video content, audio content, graphics, and the like. In some embodiments, the output device 140 may output and/or display machine-readable information in a manner that is not visible to a person. For example, the output device 140 may store and/or cache information via a computer-readable storage interface.

In some embodiments, output device 140 may be an information output device or a device that includes an information output component. Exemplary output devices 140 may include a mobile device 140-1, a display device 140-2, a speaker 130-3, an in-vehicle device 140-4, headphones, a microphone, a music player, an electronic book reader, a navigation device, a tablet computer, a laptop computer, a stylus, a printer, a projector, a storage device, and the like, or a combination thereof. Exemplary display device 140-1 may include a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) based display, a flat panel display, a curved screen, a television device, a Cathode Ray Tube (CRT), or the like, or a combination thereof.

In some embodiments, input device 130 and output device 140 may be two separate devices connected by network 120. Or the input device 130 and the output device 140 may be integrated into a single device. Thus, the device can either receive sound input from the user or output translation information in a desired form from the sound input. For example, the integrated device may be a mobile phone. The mobile phone may receive speech 170 from speaker 160. The speech may be sent to the server 110 over the network 120 and translated by the server 110 into text words in the desired language (chinese or english words) and further sent back to the mobile phone for display. Or the local microprocessor of the mobile phone may translate the speech 170 into text words in the desired language (chinese or english words) and then displayed locally by the mobile phone. When input device 130 and output device 140 are integrated together into a single device, network 120 in fig. 1 may become unnecessary because all communications between input device 130 and output device 140 become local. For illustrative purposes only, the present application treats input device 130 and output device 140 as separate devices as examples of speech recognition system 100.

The storage device 150 may store data and/or instructions. In some embodiments, storage device 150 may store data obtained from input device 130, output device 140, and/or server 110. In some embodiments, the storage device 150 may store data and/or instructions that are used by the server 110 to perform or use to accomplish the exemplary methods described herein. In some embodiments, storage device 150 may include mass storage, removable storage, volatile read-write memory, read-only memory (ROM), and the like, or any combination thereof. Exemplary mass storage may include magnetic disks, optical disks, solid state drives, and the like. Exemplary removable memory may include flash drives, floppy disks, optical disks, memory cards, zip drives, magnetic tape, and the like. Exemplary volatile read-write memory can include Random Access Memory (RAM). Exemplary RAM may include Dynamic Random Access Memory (DRAM), double data rate synchronous dynamic random access memory (DDR SDRAM), static Random Access Memory (SRAM), thyristor random access memory (T-RAM), zero capacitance random access memory (Z-RAM), and the like. Exemplary read-only memory may include mask read-only memory (MROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), compact disc read-only memory (CD-ROM), digital versatile disk read-only memory, and the like. In some embodiments, the storage device 150 may be implemented on a cloud platform. For example only, the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, an internal cloud, a multi-layer cloud, or the like, or any combination thereof.

In some embodiments, the storage device 150 may be connected to the network 120 to communicate with one or more components (e.g., the server 110, the input device 130, the output device 140, etc.) in the speech recognition system 100. One or more components in speech recognition system 100 may access data or instructions stored in storage device 150 via network 120. In some embodiments, the storage device 150 may be directly connected to or in communication with one or more components in the speech recognition system 100 (e.g., the server 110, the input device 130, the output device 140, etc.). In some embodiments, the storage device 150 may be part of the server 110.

In some embodiments, one or more components in speech recognition system 100 (e.g., server 110, input device 130, output device 140, etc.) may have permission to access storage device 150. In some embodiments, one or more components in speech recognition system 100 may read and/or modify information related to speaker 160 and/or the public when one or more conditions are met. For example, after speech recognition is complete, server 110 may read and/or modify information of one or more speakers.

Those of ordinary skill in the art will understand that elements of the speech recognition system 100 may be performed by electrical and/or electromagnetic signals when the elements are performed. For example, when the input device 130 processes a task, such as making a determination, identifying, or selecting an object, the input device 130 may operate logic circuitry in its processor to process such a task. When the input device 130 makes a request to the server 110, the processor of the input device 130 may generate an electrical signal encoding the request. The processor of the input device 130 may then send the electrical signal to the output port. If the input device 130 communicates with the server 110 via a wired network, the output port may be physically connected to a cable that may also transmit electrical signals to the input port of the server 110. If the input device 130 communicates with the server 110 via a wireless network, the output port of the input device 130 may be one or more antennas that may convert electrical signals to electromagnetic signals. Similarly, the output device 140 may process tasks through operation of logic circuitry in its processor and receive instructions and/or requests from the server 110 via electrical or electromagnetic signals. In an electronic device, such as input device 130, output device 140, and/or server 110, instructions are issued and/or actions are performed by electrical signals when the instructions are processed by its processor. For example, when the processor retrieves or saves data from a storage medium (e.g., storage device 150), it may send an electrical signal to a read/write device of the storage medium, which may read or write structured data in the storage medium. The structured data may be transmitted to the processor in the form of electrical signals via a bus of the electronic device. An electrical signal may refer to an electrical signal, a series of electrical signals, and/or at least two discrete electrical signals.

FIG. 2 is a schematic diagram of exemplary hardware and software components of a computing device 200 on which server 110, input device 130, and/or output device 140 may be implemented, shown in accordance with some embodiments of the application. For example, the processing engine 112 may be implemented on the computing device 200 and perform the functions of the processing engine 112 disclosed herein.

Computing device 200 may be used to implement any of the components of speech recognition system 100 as described herein. For example, the processing engine 112 may be implemented on a computing device by way of its hardware, software programs, firmware, or a combination thereof. Only one computer is depicted for convenience, but the relevant computer functions described in this embodiment to provide the information needed for on-demand services may be implemented in a distributed manner by a similar set of platforms to distribute the processing load of the system.

For example, computing device 200 may include a communication port 250 connected to and/or a network to which it is connected to facilitate data communications. Computing device 200 may also include a processor (e.g., processor 220) in the form of one or more processors (e.g., logic circuitry) for executing program instructions. For example, the processor includes interface circuitry and processing circuitry therein. The interface circuitry may be configured to receive electrical signals from bus 210, wherein the electrical signals encode structured data and/or instructions for the processing circuitry. The processing circuitry may perform logic calculations and then encode the determined conclusions, results, and/or instructions into an electrical signal. The interface circuit may then send out electrical signals from the processing circuit via bus 210.

An exemplary computer platform may include an internal communication bus 210, different forms of program memory and data storage, such as magnetic disk 270, read Only Memory (ROM) 230, or Random Access Memory (RAM) 240, for storing a variety of data files for processing and/or transmission by a computer. An exemplary computer platform also includes program instructions stored in ROM 230, RAM 240, and/or other forms of non-transitory storage media that can be executed by processor 220. The methods and/or processes of the present application may be implemented as program instructions. The computer device 200 also includes input/output components 260 that support input/output between the computer and other components. Computing device 200 may also receive programming and data over a network communication.

For purposes of illustration only, computing device 200 depicts only one central processing unit and/or processor. It should be noted, however, that computing device 200 of the present application may include multiple CPUs and/or processors, and thus, operations and/or methods described herein as being implemented by one CPU and/or processor may also be implemented by multiple CPUs and/or processors, either collectively or independently. For example, if the CPU and/or processor of computing device 200 performs both step A and step B in the present application, it should be understood that step A and step B may also be performed jointly or independently by two different CPUs and/or processors of computing device 200 (e.g., a first processor performs step A and a second processor performs step B, or the first and second processors jointly perform steps A and B).

Fig. 3 is a schematic diagram of exemplary hardware and/or software components of an exemplary mobile device 300, upon which input device 130 and/or output device 140 may be implemented, shown in accordance with some embodiments of the present application. As shown in fig. 3, mobile device 300 may include a communication platform 310, a display 320, an image processing unit (GPU) 330, a Central Processing Unit (CPU) 340, input/output (I/O) 350, memory 360, storage 390, voice input 305, and voice output 315. In some embodiments, any other suitable component, including but not limited to a system bus or controller (not shown), may also be included within mobile device 300.

In some embodiments, an operating system 370 (e.g., iOS ^TM、Android^TM、Windows Phone^TM, etc.) and one or more application programs 380 may be downloaded from storage 390 to memory 360 and executed by CPU 340. Application 380 may include a browser or any other suitable mobile application for receiving and presenting information related to image processing or other information in processing engine 112. User interaction with the information stream may be accomplished through I/O350 and provided to processing engine 112 and/or other components of on-demand service system 100 through network 120. The voice input 305 may include an acoustic input component (e.g., a microphone). The speech output 315 may include a sound generator (e.g., speaker) that produces sound.

To implement the various modules, units, and functions thereof described herein, a computer hardware platform may be used as a hardware platform for one or more of the components described herein. A computer with a user interface component may be used to implement a Personal Computer (PC) or any other type of workstation or terminal device. If properly programmed, the computer can also act as a server.

Fig. 4A and 4B are block diagrams of exemplary processing engines 112A and 112B, shown in accordance with some embodiments of the present application. In some embodiments, the processing engine 112A may be configured to translate speech based on a trained speech recognition neural network model. The processing engine 112B may be configured to train the initial neural network model to generate a trained speech recognition neural network model. In some embodiments, processing engines 112A and 112B may be implemented on computing device 200 (e.g., processor 210) shown in FIG. 2 or CPU 340 shown in FIG. 3, respectively. For example only, the processing engine 112A may be implemented on the mobile device's CPU 340 and the processing engine 112B may be implemented on the computing device 200 alone. Or processing engines 112A and 112B may be implemented on the same computing device 200 or the same CPU 340.

The processing engine 112A may include an acquisition module 411, a determination module 412, a translation module 413, and a generation module 414.

The acquisition module 411 may be configured to obtain information related to the speech recognition system 100. For example, the acquisition module 411 may acquire a target audio signal that includes the speech 170 of the speaker 160, a historical audio signal that includes one or more historical speech of the speaker 160, an accent vector of the speaker 160, one or more regional accent models, a trained speech recognition neural network model, or the like, or any combination thereof. The acquisition module 411 may obtain information related to the speech recognition system 100 from an external data source via the network 120 and/or from one or more components of the speech recognition system 100, such as a storage device, the server 110 (e.g., the processing engine 112B).

The determination module 412 may determine one or more speech characteristics of the target audio signal, such as pitch, speech rate, linear Prediction Coefficients (LPC), mel-level frequency cepstral coefficients (MFCC), linear Prediction Cepstral Coefficients (LPCC) of the target audio signal. In some embodiments, determination module 412 may determine an accent vector for speaker 160. The accent vector may describe the accent of the speaker 160. For example, an accent vector may include one or more elements, each of which may correspond to a regional accent and include likelihood values associated with the regional accent. Details regarding the determination of accent vectors may be found elsewhere in the present application (e.g., operation 530, fig. 6 and 7, and their associated descriptions).

The translation module 413 can translate the speech 170 of the speaker 160. For example, translation module 413 can input one or more speech features of a target audio signal comprising speech 170, an accent vector of speaker 160, and/or a local accent corresponding to speech 170 to a trained speech recognition neural network to translate speech 170 into a target content form. The target content form may be phonemes, syllables, letters, etc., or any combination thereof. Details regarding the translation of the speech 170 may be found elsewhere in the present application (e.g., operation 550 and its associated description).

The generation module 414 may generate an interface through the output device 140 to present the speech 170 in the target content form. The interface generated by the output device 140 may be configured to present the voice 170 in the target content form. In some embodiments, to present the speech 170 in different target content forms, the generation module 414 may generate different interfaces through the output device 140. For example, to present the speech 170 in the form of sound, the generation module 414 may generate a play interface through the output device 140; to present the speech 170 in text form, the generation module 414 may generate a display interface through the output device 140.

The processing engine 112B may include an acquisition module 421, a determination module 422, and a training module 423.

The acquisition module 421 may obtain information for generating a trained speech recognition neural network model. For example, the acquisition module 421 may acquire a sample audio signal comprising at least two sample voices of at least two sample speakers, a sample accent vector for each sample speaker, a sample local accent corresponding to each sample voice, an initial neural network model for training, and the like, or any combination thereof. The acquisition module 421 may obtain information for generating a trained speech recognition neural network model from an external data source via the network 120 and/or from one or more components of the speech recognition system 100 (e.g., storage device, server 110).

The determination module 422 may determine one or more sample speech features of a sample audio signal that includes sample speech. For sample speech, the sample speech features of the corresponding sample audio signal may include pitch, speech rate, LPC, MFCC, LPCC, etc., or any combination thereof. Details regarding the determination of sample speech features may be found elsewhere in the present application (e.g., operation 820 and its associated description).

The training module 423 may train the initial neural network model to generate a trained speech recognition neural network model. For example, training module 423 may train an initial neural network model using input data (e.g., information related to at least two audio signals including at least two sample voices of at least two sample speakers). Details regarding the generation of the trained speech recognition neural network model may be found elsewhere in the present application (e.g., operation 860 and its associated descriptions).

The modules in the processing engines 112A and 112B may be connected or communicate with each other via wired or wireless connections. The wired connection may include a metal cable, fiber optic cable, hybrid cable, or the like, or any combination thereof. The wireless connection may include a Local Area Network (LAN), a Wide Area Network (WAN), bluetooth, zigBee network, near Field Communication (NFC), or the like, or any combination thereof. Two or more modules may be combined into one module, and any one module may be split into two or more units. For example, the processing engines 112A and 112B may be integrated into the processing engine 112 for generating a trained speech recognition neural network model and applying the trained speech recognition neural network model in speech recognition. In some embodiments, processing engine 112 (processing engine 112A and/or processing engine 112B) may include one or more additional modules. For example, processing engine 112A may include a storage module (not shown) configured to store data.

FIG. 5 is a flowchart illustrating an exemplary process of speech recognition according to some embodiments of the application. Process 500 may be performed by speech recognition system 100. For example, process 500 may be implemented as a set of instructions (e.g., an application program) stored in storage device 150. Processing engine 112 (e.g., processing engine 112A) may execute a set of instructions and thus be directed to perform process 500.

At 510, the processing engine 112A (e.g., the acquisition module 411) may acquire a target audio signal that includes the speech 170 of the speaker 160.

The target audio signal may represent the voice 170 and characteristic information (e.g., frequency) of the voice 170 is recorded. In some embodiments, the target audio signal may be obtained by one or more components of the speech recognition system 100. For example, the target audio signal may be obtained from an input device 130 (e.g., smart phone 130-1, headset 130-2, microphone 130-3, navigation device). The target audio signal may be input by speaker 160 via input device 130. As another example, the target audio signal may be retrieved from a storage device (e.g., storage device 150 and/or memory 390) in the speech recognition system 100. In some embodiments, the target audio signal may be obtained from an external data source, such as a voice library, via the network 120.

In some embodiments, voice 170 may include a request for on-demand services, such as taxi services, driver services, express services, carpool services, bus services, driver rental services, and airliner services. Additionally or alternatively, the voice 170 may include information related to a request for on-demand service. For example only, voice 170 may include a start point and/or a destination related to requesting taxi services. In some embodiments, the voice 170 may include commands that instruct the input device 130 and/or the output device 140 to perform a particular action. For example only, the voice 170 may include a command that instructs the input device 130 to call someone.

At 520, the processing engine 112A (e.g., the determination module 412) may determine one or more speech features of the target audio signal.

Voice characteristics may refer to acoustic properties of the voice 170 that may be recorded and analyzed. Exemplary speech features may include pitch, speech rate, linear Prediction Coefficients (LPCs), mel-level frequency cepstral coefficients (MFCCs), linear Prediction Cepstral Coefficients (LPCCs), and the like, or any combination thereof.

In some embodiments, the speech features may be represented or recorded in the form of feature vectors. In some embodiments, the feature vector may be represented as a carrier having a column or row. For example, the feature vector may be a row vector represented as a1×n determinant (e.g., a1×108 determinant). In some embodiments, the feature vector may correspond to an N-dimensional coordinate system. An N-dimensional coordinate system may be associated with the N-phonetic feature. In some embodiments, the determination module 412 may immediately process one or more feature vectors. For example, the m first feature vectors (e.g., three rows of vectors) may be integrated into a1×mn vector or an m×n matrix, where m is an integer.

In some embodiments, the determined feature vector may be indicative of one or more speech features of the target audio signal during a time window, e.g., 10 milliseconds (ms), 25ms, or 50ms. For example, the determination module 412 may segment the target audio signal into at least two time windows. The different time windows may have the same duration or different durations. Two consecutive time windows may or may not overlap each other. The determination module 412 may then perform a Fast Fourier Transform (FFT) on the audio signal in each time window. From the FFT data of the time window, the determination module 412 may extract speech features represented as feature vectors of the time window.

At 530, processing engine 112A (e.g., acquisition module 411) may obtain an accent vector for speaker 160.

An accent vector may refer to a vector describing the accent of speaker 160. The accent of the speaker 160 is a specific pronunciation pattern for the speaker 160. In speech recognition, accent utterances may cause false recognition and word recognition failure. If the accent of the speaker 160 is considered, the accuracy of speech recognition may be improved. Typically, a person's accent is associated with one or more locations where the person lives. In the case of a person living in at least two locations, his/her accent may be a mixed accent. By way of example only, for a person born in the Shandong, learning in the Shanghai, working now in Beijing, his/her accent may be a mix and/or combination of Shandong accent, shanghai accent and Beijing accent.

To describe the accent of speaker 160, an accent vector is provided. In some embodiments, the accent vector may include one or more elements, each of which may correspond to a regional accent and include likelihood values associated with the regional accent. Region accents may represent the manner of pronunciation specific to one or more particular regions. As used herein, regional accents are virtually identical to the geographic region from which the accent came, since the accent is almost always localized, i.e., regional accents may also be referred to as one or more geographic regions to which the accent pertains. For example, the Shandong accent may also be referred to as a Shandong accent, since the Shandong accent is mainly spoken by people growing in or living in Shandong. For another example, the northeast accent of China may also be referred to as Liaoning province, jilin province, heilongjiang province, and/or elsewhere where the northeast accent of China belongs. In some embodiments, a relatively large region may be divided into at least two sub-regions according to the respective region accents. For example, china may be divided into at least two regions, each region having a unique region accent. Because the accents of the three provinces are similar, liaoning provinces, jilin provinces and Heilongjiang provinces can be divided into one region. In some embodiments, the correspondence between the region and the region accent may be stored in a storage device of the speech recognition system 100, such as the storage device 150, ROM 230, RAM240, and/or memory 390.

The likelihood value associated with the regional accent may represent a likelihood value that the accent of speaker 160 is the regional accent. In other words, likelihood values associated with regional accents may measure the similarity or difference between the accents of speaker 160 and the regional accents. For example, if the likelihood value associated with a Beijing accent is at or near 100%, speaker 160 may speak a pure Beijing dialect. As another example, if the likelihood values associated with mandarin and taiwan accents are 90% and 10%, respectively, speaker 160 may speak mandarin and slight taiwan accents.

In some embodiments, the accent vector of speaker 160 may be a single thermal vector or an embedded vector. In a single heat vector, likelihood values associated with regional accents that speaker 160 has may be represented as 1, and likelihood values of regional accents that speaker 160 does not have may be represented as 0. The regional accent that speaker 160 has may refer to a regional accent having a corresponding likelihood value equal to or greater than a certain value (e.g., 0%, 10%, etc.). In the embedded vector, the value of the regional accent that the speaker 160 has may be represented as a real number, and the likelihood value of the regional accent that the speaker 160 does not have may be represented as 0. For example only, the accent vector may be represented as the following equation (1):

X＝{X(1)，X(2)，…，X(i)} (1)

Wherein X refers to the accent vector of speaker 160; i refers to the i-th region accent; x (i) refers to likelihood values associated with the i-th region accent. The likelihood value L _i may be any positive value. For example, if mandarin is the first element in vector X, shandong accent is the second element in vector X, and shanghai accent is the 3 rd element in vector X, then the vector of x= {0.7,0.2,0.1} may mean that the speaker's accent includes or consists essentially of 80% mandarin, 20% shandong accent, and 10% shanghai accent.

In some embodiments, the acquisition module 411 may obtain the accent vector of the speaker 160 from an external source and/or one or more components of the speech recognition system 100. Further, the speaker's accent profile (e.g., accent vector X) may be predetermined by the speech recognition system 100. For example, the accent vector of speaker 160 may be recorded and/or included in his/her user profile and stored in a memory device (e.g., memory device 150, ROM 230, RAM 240, and/or memory 390) of speech recognition system 100. The acquisition module 411 may access the storage device and retrieve the accent vector of the speaker 160. In some embodiments, the accent vector of speaker 160 may be input by speaker 160 via input device 130. Additionally or alternatively, the accent vector may be determined by the processing engine 112A (e.g., the determination module 412) and sent to a storage device for storage.

In some embodiments, based on the user profile of speaker 160, determination module 412 may determine one or more regional accents and corresponding likelihood values. The user profile of speaker 160 may include hometown, telephone number, educational experience, work experience, log information (e.g., weblogs, software logs), historical service orders, etc., or any combination thereof. Since the accent vector determined under the user profile is not dependent on his/her actual accent, the value (or likelihood value) of the vector element may be a binary value. For example, the determination module 412 may determine one or more zone accents that the speaker 160 has based on the speaker 160's birth, learning, work, long-lived zone (e.g., length of residence longer than a threshold), etc., or any combination thereof. Thus, based on where the speaker is born, the likelihood value of the regional accent associated with the geographic region where the speaker 160 is born may be designated as 1 and the regional accent associated with the geographic region other than where the speaker is born may be designated as 0. Based on the length of time that speaker 160 remains, the zone accent associated with the geographic zone where the speaker resides and/or has a lifetime longer than the threshold may be designated as 1 and the zone accent associated with the geographic zone where the speaker resides and/or has a lifetime shorter than the threshold may be designated as 0.

In some embodiments, the user profile of speaker 160 may include a historical audio signal that includes one or more historical voices of speaker 160. The determination module 412 may determine the accent vector of the speaker 160 by analyzing the historical audio signal. Details regarding determining accent vectors based on historical audio signals may be found elsewhere in the present application (e.g., fig. 6 and 7 and their associated descriptions).

In 540, the collection module 411 may obtain a local accent of the region of origin of the speech 170. The manner in which a person speaks may be affected by the language environment. For example, for a speaker with a mixed accent of shandong and beijing, he/she may have a more pronounced shandong accent in shandong and a more pronounced beijing accent in beijing. For another example, a speaker may not have a taiwan accent, and he/she may speak with a slight taiwan accent while in taiwan. Thus, local accents in the region of origin of the speech 170 may need to be considered in speech recognition.

In some embodiments, input device 130 may be a device with positioning technology for positioning speaker 160 and/or input device 130. Or the input device 130 may communicate with another pointing device to determine the location of the speaker 160 and/or the input device 130. As or after speaker 160 inputs speech 170 via input device 130, input device 130 may send location information of speaker 160 to server 110 (e.g., processing engine 112A). Based on the location information, the server 110 (e.g., the processing engine 112A) may determine a local accent of the region of origin of the speech 170. For example, the determination module 412 may determine a region accent for a region that generates the speech 170 based on a correspondence between the region and the region accent. The determination module 412 may also designate a regional accent of the region where the speech 170 is generated as a local accent.

In 550, the processing engine 112A (e.g., translation module 413) may input one or more speech features of the target audio signal, the accent vector of the speaker 160, and the local accent into the trained speech recognition neural network to translate the speech 170 into the target content form.

The target content form may be phonemes, syllables, letters, etc., or any combination thereof. Phonemes refer to perceptually different units of speech in a given language that distinguish one word from another. Syllables refer to pronunciation units with a vowel, with or without surrounding consonants, forming all or part of a word, and translated speech in syllable form may be the sound or speech of a word. The phonemes and/or syllables may be speech at different accents. For example, the target content form of the audio speech under the Shandong accent may include phonemes of the same speech used by Mandarin or Cantonese. The phonemes and/or syllables may also be another language than the original speech. For example, audio speech may be spoken in chinese, and the target content of the audio speech may include phonemes of the same speech spoken in english. Letters refer to units of written language and translated speech in the form of letters may be one or more words. For example, the processing engine 112A may recognize phonemes and/or syllables in the audio speech and translate the phonemes and/or syllables into letters of a written word in a particular language that is the same or different than the language of the audio speech. In some embodiments, the target content form may be a target language or accent. For example, the translated speech may be an English word.

The trained speech recognition neural network model may be configured to generate a neural network output based on the input. For example, the trained speech recognition neural network model may output likelihood values for particular phones corresponding to inputs (e.g., speech features, accent vectors, and local accents). In some embodiments, the neural network output may be the same form of content as the target content form. For example, the neural network output may be in the form of phonemes, which are identical to the target content form. There is no need to convert the neural network output into the target content form. In some embodiments, the neural network output may be a different form of content than the target form of content. For example, the neural network output may be in the form of phonemes, while the target content form is syllables or letters. The translation module 413 may further transform the neural network output into a target content form. For example only, the translation module 413 may input the neural network output in the form of phonemes into a set of Weighted Finite State Transducers (WFST) to generate a word lattice. The translation module 413 can also derive a transcription of the speech 170 from the word lattice. The transcription of the speech 170 may be a sound including spoken words or text including words. In some embodiments, the target content form may be a target language or accent. Transcription of the speech 170 may be further translated into a target language or accent. Translation of the transcription may be performed by translation module 413 based on translation techniques or an external translation platform (e.g., translation website, translation software).

In some embodiments, the trained speech recognition neural network model may be retrieved by translation module 413 from a storage device (e.g., storage device 150) in speech recognition system 100 and/or an external data source (not shown) via network 120. In some embodiments, the processing engine 112B may generate and store the trained speech recognition neural network model in a storage device. Translation module 413 may access the storage device and retrieve the trained speech recognition neural network model.

In some embodiments, the processing engine 112B may train the trained speech recognition neural network model based on a machine learning method. The machine learning method may include, but is not limited to, artificial neural network algorithms, deep learning algorithms, decision tree algorithms, association rule algorithms, inductive logic programming algorithms, support vector machine algorithms, clustering algorithms, bayesian network algorithms, reinforcement learning algorithms, representation learning algorithms, similarity and metric learning algorithms, sparse dictionary learning algorithms, genetic algorithms, rule-based machine learning algorithms, and the like, or any combination thereof. In some embodiments, the processing engine 112B may train the trained speech recognition neural network model by performing the process 800 shown in fig. 8.

At 560, the generation module 414 may generate an interface through the output device 140 to present the speech 170 in the target content form.

The interface generated by the output device 140 may be configured to present the voice 170 in the target content form. To present the speech 170 in different target content forms, the generation module 414 may generate different interfaces through the output device 140. In some embodiments, the interface for presenting the voice 170 may be human-visible. For example, the speech 170 in the target content form may be one or more sounds and/or words, as described in connection with operation 550. To present the speech 170 in the form of sound, the generation module 414 may generate a playback interface through the output device 140. To present the speech 170 in text form, the generation module 414 may generate a display interface through the output device 140. In some embodiments, the interface for presenting the voice 170 may be invisible to humans but machine readable. For example, the speech 170 in the target content form may be phonemes. The generation module 414 may generate a storage interface through the output device 140 to store and/or buffer the speech 170 in the form of phonemes, which may not be directly readable by the human eye, but may be readable, translatable, and usable by a smart device (e.g., a computer or other device).

Output device 140 may be, for example, a mobile device 140-1, a display device 140-2, a speaker 130-3, an in-vehicle device 140-4, a storage device, and/or any device that may output and/or display information. In some embodiments, the input device 130 and the output device 140 may be integrated into the device or into two separate devices. For example, the speech 170 may be recorded by a mobile phone and then translated into words by the processing engine 112A. The generation module 414 may generate an interface for displaying the translated word through the mobile phone itself or another output device 140 (e.g., another mobile phone, a laptop computer, etc.).

It should be noted that the above description of process 500 is provided for illustrative purposes only and is not intended to limit the scope of the present application. Various modifications and alterations will occur to those skilled in the art in light of the present description. However, such changes and modifications do not depart from the scope of the present application.

In some embodiments, one or more additional operations may be added, or one or more operations of process 500 may be omitted. For example, operation 540 may be omitted. In 550, translation module 413 may input the one or more speech features of the target audio signal and the accent vector of speaker 160 to the trained speech recognition neural network model. In some embodiments, the order of the operations of process 500 may be changed. For example, operations 530 and 540 may be performed simultaneously or in any order.

In some embodiments, in 530, the accent of speaker 160 may be expressed in a form other than an accent vector. For example, the accent may be represented by a polynomial or matrix. The polynomial or matrix may include one or more elements that are similar to the elements of the accent vector. Each element may correspond to a region accent and include likelihood values associated with the region accent. In some embodiments, in 530, the acquisition module 411 may acquire at least two accent vectors. The at least two accent vectors may correspond to different regional accents. Alternatively, the at least two accent vectors may be integrated into a single accent vector by the determination module 412 before being input into the trained speech recognition neural network model. In some embodiments, the determination module 412 may also acquire or determine one or more audio characteristics related to the target audio signal. The audio characteristics may be independent of words in the speech 170 spoken by the speaker 160. For example, the audio features may indicate one or more features corresponding to background noise, recording channel properties, speaker's speaking style, speaker's gender, speaker's age, etc., or any combination thereof. At 550, the audio features may be input to the trained speech recognition neural network model along with one or more speech features, accent vectors, and/or local accents.

Fig. 6 is a flowchart illustrating an exemplary process for determining an accent vector for a speaker based on one or more regional accent models, according to some embodiments of the application. Process 600 may be performed by speech recognition system 100. For example, process 600 may be implemented as a set of instructions (e.g., an application) stored in storage device 150. Processing engine 112 (e.g., processing engine 112A) may execute a set of instructions and, thus, may instruct execution of process 600. In some embodiments, process 600 may be an embodiment of operation 530 with reference to fig. 5.

At 610, processing engine 112A (e.g., acquisition module 411) may acquire a historical audio signal comprising one or more historical voices of speaker 160. Each historical speech may be encoded in one or more historical audio signals. The historical audio signal may be obtained from one or more components of the speech recognition system 100 or an external data source. For example, the historical audio signal may be obtained from a storage device (e.g., storage device 150, ROM 230, RAM 240, and/or memory 390) in the speech recognition system 100. In some embodiments, the historical speech may be input by speaker 160 via input device 130. In some embodiments, the history speech may include history requests for on-demand services or information related to history requests. The historical audio signal comprising the historical speech may be similar to the target audio signal comprising the speech 170 described in connection with operation 510 and the description thereof is not repeated.

At 620, for each of the one or more historical voices, the processing engine 112A (e.g., determination module 412) may determine one or more historical voice characteristics of the respective historical audio signal. For historical speech, the historical speech characteristics of the corresponding historical audio signal may include pitch, speech rate, LPC, MFCC, LPCC, etc., or any combination thereof. In some embodiments, the historical speech features corresponding to the historical speech may be represented or recorded in the form of feature vectors. Operation 620 may be performed in a similar manner to operation 520, and a description thereof will not be repeated here.

In 630, the processing engine 112A (e.g., the acquisition module 411) may obtain one or more regional accent models. The regional accent model may be a specific accent model corresponding to the speaker's language or regional accent. For example only, the regional accent model may include languages corresponding to different languages (e.g., english, japanese, or Spanish) and/or languages corresponding to different regional accents (e.g., regional accents may include Mandarin, cantonese, etc., for Chinese; and may include United states English, british English, indian English, etc., for English). The regional accent model corresponding to the regional accent or language may be configured to generate a model output based on the speech characteristics of the speech. The model output may indicate that the speaker of the speech has a corresponding regional accent or likelihood or probability of speaking the language. The model output may be further used to construct an accent vector for the speaker, as will be described in detail in connection with operation 650.

In some embodiments, the acquisition module 411 may obtain the regional accent model from an external data source or one or more components of the speech recognition system 100. For example, the acquisition module 411 may obtain the regional accent model from, for example, a speech database and/or a language library external to the speech recognition system 100. As another example, the acquisition module 411 may obtain the regional accent model from a storage device (e.g., the storage device 150, the ROM 230, the RAM 240, and/or the memory 390) of the speech recognition system 100. In some embodiments, the regional accent model may be trained by the server 110 (e.g., the processing engine 112B) and stored in a storage device. For example, a region accent model corresponding to a particular region accent may be trained by the processing engine 112B using a training set. The training set may comprise, for example, speech features of at least two sample voices belonging to a regional accent.

At 640, for each of the one or more historical voices, the processing engine 112A (e.g., the determination module 412) may input one or more corresponding historical voice features into each of the one or more regional accent models. For historical speech, each regional accent model may generate a model output that indicates likelihood values or probabilities that speaker 160 has a corresponding regional accent. For example, historical speech features derived from three historical voices may be respectively input into the beijing accent model. The output of the Beijing accent model may represent that speaker 160 has 80%, 85%, and 70% likelihood values for the Beijing accent.

In 650, determination module 412 may determine at least two elements of the accent vector of speaker 160 based on the output of the at least one or more regional accent models. Each element of the accent vector may correspond to a regional accent and include likelihood values associated with the regional accent as described in connection with fig. 6. Likelihood values associated with the regional accents may be determined from the output of the regional accent model corresponding to the regional accents.

Taking Beijing accent as an example, based on the output of the Beijing accent model relative to one or more historical voices, corresponding likelihood values may be determined. For example, the determination module 412 may determine that the speaker 160 has an overall likelihood value for the beijing accent, and further determine the likelihood value accordingly. The overall likelihood that speaker 160 has a beijing accent may be, for example, the maximum, average, or median of the output of the beijing accent model. In some embodiments, the likelihood value of the beijing accent may be determined by normalizing the overall likelihood value of the beijing accent with the overall likelihood values of other regional accents. In some embodiments, the determination module 412 may apply a minimum threshold in the likelihood value determination. In this case, when the (normalized) overall likelihood value of the beijing accent is below the minimum threshold, the determination module 412 may determine that the speaker 160 does not have the beijing accent, and the corresponding likelihood value may be represented as 0. When the (normalized) overall likelihood value of the beijing accent is not below the minimum threshold, the determination module 412 may determine that the speaker 160 has a beijing accent, and the corresponding likelihood value may be represented as either a (normalized) overall likelihood value or 1. In some embodiments, the determination module 412 may rank the overall likelihood values of the accents of the different regions. When the beijing accent is ranked in the top N (e.g., 1,3, 5) of the ranking result, the determination module 412 may determine that the speaker 160 has a beijing accent, and the corresponding likelihood value may be represented as a (normalized) overall likelihood value or 1. Otherwise, the likelihood value associated with the beijing accent may be represented as 0.

Fig. 7 is a flowchart illustrating an exemplary process for determining an accent vector for a speaker based on an accent classification model, according to some embodiments of the application. Process 700 may be performed by speech recognition system 100. For example, process 700 may be implemented as a set of instructions (e.g., an application program) stored in storage device 150. Processing engine 112 (e.g., processing engine 112A) may execute a set of instructions and, thus, may instruct execution of process 700. In some embodiments, process 700 may be an embodiment of operation 530 with reference to fig. 5.

At 710, the processing engine 112A (e.g., the acquisition module 411) may acquire a historical audio signal including one or more historical voices of a speaker. At 720, for each of the one or more historical voices, the determination module 412 may determine one or more historical voice characteristics of the respective historical audio signal. Operations 710 and 720 may be performed in a similar manner to operations 610 and 620, respectively, and a description thereof will not be repeated here.

At 730, the processing engine 112A (e.g., the acquisition module 411) may obtain an accent classification model. The accent classification model may be configured to receive speech features of speech and classify accents of speakers speaking the speech into one or more accent classifications. Accent classifications may include, for example, one or more languages (e.g., english, japanese, or spanish) and/or one or more regional accents (e.g., mandarin, cantonese, taiwan accent, american english, uk english). In some embodiments, the classification result may be represented by one or more regional accents that the speaker has. Additionally or alternatively, the classification result may include a probability or likelihood value that the speaker has a particular region accent. For example, the classification result may indicate that the speaker's accent has a mandarin likelihood value of 70% and a cantonese likelihood value of 30%.

In some embodiments, the acquisition module 411 may obtain the accent classification model from an external data source and/or one or more components of the speech recognition system 100 (e.g., storage device, server 110). The acquisition of the accent classification model may be similar to the acquisition of the regional accent model described in connection with operation 630, and the description thereof is not repeated. In some embodiments, the accent classification model may be trained by the server 110 (e.g., the processing engine 112B) and stored in a storage device. For example, the accent classification model may be trained by the processing engine 112B using a set of training samples, each of which may be labeled as belonging to a particular accent classification.

At 740, for each of the one or more historical voices, the processing engine 112A (e.g., the determination module 412) may input one or more corresponding historical voice features into the accent classification model. For historical speech, the accent classification model may output classification results for the accents of the speaker 160.

In 750, processing engine 112A (e.g., determination module 412) may determine at least two elements of the accent vector of speaker 160 based at least on the output of the accent classification model. Each element of the accent vector may correspond to a regional accent and include likelihood values associated with the regional accent as described in connection with fig. 6. Likelihood values associated with regional accents may be determined based on classification results of the accent classification model.

In some embodiments, the classification results corresponding to the historical speech may include one or more regional accents that the speaker 160 has. The determination module 412 may determine that the speaker 160 has an overall likelihood value for the region accent and assign it as the likelihood value associated with the region accent. By way of example only, a classification result based on historical speech indicates that speaker 160 has a Beijing accent and a Shandong accent, and a classification result based on another historical speech indicates that speaker 160 has only a Beijing accent. The determination module 412 may determine that the overall likelihood that the speaker 160 has a Beijing accent and a Shandong accent is 2/3 and 1/3. In some embodiments, the classification results corresponding to the historical speech may include likelihood that speaker 160 has a particular region accent. The determination module 412 may determine that the speaker 160 has overall likelihood values for the region-specific accent and further determine likelihood values accordingly. Details regarding determining likelihood values associated with regional accents based on their overall likelihood values may be found elsewhere in the present application (e.g., operation 650 and its associated descriptions).

FIG. 8 is a flowchart illustrating an exemplary process for generating a trained speech recognition neural network model, according to some embodiments of the application. Process 800 may be performed by speech recognition system 100. For example, process 800 may be implemented as a set of instructions (e.g., an application) stored in storage device 150. Processing engine 112 (e.g., processing engine 112B) may execute a set of instructions and, thus, may instruct processing engine 112 to execute processing engine 112.

At 810, the processing engine 112B (e.g., the acquisition module 421) may acquire a sample audio signal comprising at least two sample voices of at least two sample speakers. Each sample speech of a sample speaker may be encoded or included in one or more sample audio signals. In some embodiments, the sample audio signal may be obtained from one or more components of the speech recognition system 100, such as a storage device (storage device 150, ROM 230, RAM 240, and/or memory 390), input device 130, and the like. For example, the storage device 150 may store a historical audio signal comprising at least two historical voices of a user of the speech recognition system 100. The historical audio signal may be retrieved from the storage device 150 and designated as a sample audio signal by the acquisition module 421. In some embodiments, the sample audio signal may be obtained from an external data source (e.g., a voice library) via the network 120. The sample audio signal including the sample speech may be similar to the target audio signal including the speech 170 described in connection with operation 510, and the description thereof is not repeated.

At 820, for each of the at least two sample voices, the processing engine 112B (e.g., determination module 422) may determine one or more sample voice features of the respective sample audio signal. For sample speech, the sample speech features of the corresponding sample audio signal may include pitch, speech rate, LPC, MFCC, LPCC, etc., or any combination thereof. Operation 820 may be performed in a similar manner to operation 520, and a description thereof is not repeated here.

At 830, for each of the at least two sample voices, the processing engine 112B (e.g., the collection module 421) can determine a sample accent vector for the respective sample speaker. The sample accent vector of the sample speaker may be a vector describing the accent of the sample speaker. The acquisition of the sample accent vector may be similar to the acquisition of the accent vector described in connection with operation 530, and the description thereof is not repeated.

At 840, for each of the at least two sample voices, the processing engine 112B (e.g., the collection module 421) may obtain a sample local accent of the sample voice origin region. Operation 840 may be performed in a similar manner to operation 540 and the description thereof will not be repeated here.

At 850, the processing engine 112B (e.g., the acquisition module 421) may acquire an initial neural network model. Exemplary initial neural network models may include Convolutional Neural Network (CNN) models, artificial Neural Network (ANN) models, recurrent Neural Network (RNN) models, deep belief network models, perceptron neural network models, stacked self-encoding network models, or any other suitable neural network model. In some embodiments, the initial neural network model may include one or more initial parameters. One or more initial parameters may be adjusted during the training process of the initial neural network model. The initial parameters may be default settings of the speech recognition system 100 or may be adjustable in different situations. In some embodiments, the initial neural network model may include at least two processing layers, e.g., an input layer, a hidden layer, an output layer, a conventional layer, a sink layer, an activation layer, etc., or any combination thereof.

In 860, the processing engine 112B (e.g., the training module 423) may determine a trained speech recognition neural network model by inputting one or more sample speech features corresponding to each sample speech, a sample accent vector of a sample speaker, and a sample local accent into the initial neural network model. For brevity, the sample speech features, sample accent vectors of the sample speakers, and sample local accents corresponding to each sample speech input into the initial neural network model may be referred to as input data.

In some embodiments, the input data may be input into an initial neural network model to generate an actual output. The training module 423 may compare the actual output to the expected or correct output to determine the loss function. The desired or correct output may include, for example, desired or corrected likelihood values of the input data corresponding to a particular phoneme. In some embodiments, the expected or corrected likelihood value may be determined based on a correct translation of the sample speech (possibly in word or sound format). The loss function may measure the difference between the actual output and the desired output. During the training process, the training module 423 may update the initial parameters to minimize the loss function. In some embodiments, the minimization of the loss function may be iterative. The iteration of minimizing the loss function may be terminated until the new loss function is less than a predetermined threshold. The predetermined threshold may be manually set or determined based on various factors (e.g., accuracy of the trained speech recognition neural network model, etc.).

It should be noted that the above description of process 800 is provided for illustrative purposes only and is not intended to limit the scope of the present application. Many variations and modifications will be apparent to those of ordinary skill in the art, given the benefit of this disclosure. However, such changes and modifications do not depart from the scope of the present application. In some embodiments, one or more additional operations may be added, or one or more operations of process 800 may be omitted. For example, operation 840 may be omitted. At 860, training module 423 may train the initial neural network model using one or more sample speech features of the sample speaker corresponding to each sample speech and the sample accent vector.

FIG. 9 is a schematic diagram of an exemplary trained speech recognition neural network model 910, according to some embodiments of the application. As shown in fig. 9, the trained speech recognition neural network model 910 may include at least two processing layers, e.g., an input layer 911, a plurality of hidden layers (e.g., 912A and 912B), and an output layer 913.

To identify the speech 170 of the speaker 160, the input layer 911 may receive input data related to the speech 170. For example, one or more speech features 930 originating from the speech 170, an accent vector of the speaker 160, and a local accent of the region of origin of the speech 170 may be provided as input data to the input layer 911. In some embodiments, the accent vector of speaker 160 may be determined based on one or more accent models 921 (e.g., regional accent models or accent classification models). The regional accent model or accent classification model may be a trained neural network model for determining and/or classifying accents of the speaker 160.

The output layer 913 of the trained speech recognition neural network model 910 may generate a model output. Model outputs may include, for example, likelihood values that the combination of speech features 930, accent vector 920, and local accents represent a particular phoneme. In some embodiments, the processing engine 112A (e.g., translation module 413) may further convert the model output into a transcription of the target content form (e.g., sound, word, etc.). The transcript may be sent to a user terminal for presentation.

It should be noted that the example shown in fig. 9 is provided for illustration purposes only and is not intended to limit the scope of the present application. Various changes and modifications may be made by one of ordinary skill in the art in light of the description of the application. However, such changes and modifications do not depart from the scope of the present application. For example, the trained speech recognition neural network model 910 may include other types of processing layers, such as a conventional layer, a collection layer, an activation layer, and the like, or any combination thereof. As another example, the trained speech recognition neural network model 910 may include any number of processing layers. As yet another example, the input data may include speech features 930 and accent vectors 920 without local accents.

While the basic concepts have been described above, it will be apparent to those of ordinary skill in the art after reading this application that the above disclosure is by way of example only and is not intended to be limiting. Although not explicitly described herein, various modifications, improvements, and adaptations of the application may occur to one of ordinary skill in the art. Such modifications, improvements, and modifications are intended to be suggested within the present disclosure, and therefore, such modifications, improvements, and adaptations are intended to be within the spirit and scope of the exemplary embodiments of the present disclosure.

Meanwhile, the present application uses specific words to describe embodiments of the present application. For example, "one embodiment," "an embodiment," and/or "some embodiments" means a particular feature, structure, or characteristic in connection with at least one embodiment of the application. Thus, it should be emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various positions in this specification are not necessarily referring to the same embodiment. Furthermore, certain features, structures, or characteristics of one or more embodiments of the application may be combined as suitable.

Furthermore, those of ordinary skill in the art will appreciate that aspects of the application are illustrated and described in the context of a number of patentable categories or conditions, including any novel and useful processes, machines, products, or materials, or any novel and useful improvements thereof. Accordingly, aspects of the present application may be performed entirely by hardware, entirely by software (including firmware, resident software, micro-code, etc.) or by a combination of hardware and software. The above hardware or software may be referred to as a "unit," module, "or" system. Furthermore, aspects of the present application may take the form of a computer program product embodied in one or more computer-readable media, wherein the computer-readable program code is embodied therein.

The computer readable signal medium may comprise a propagated data signal with computer program code embodied therein, for example, on a baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including electro-magnetic, optical, etc., or any suitable combination. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code located on a computer readable signal medium may be propagated through any suitable medium including radio, cable, fiber optic cable, RF, etc., or any combination of the foregoing.

The computer program code necessary for operation of portions of the present application may be written in any one or more programming languages, including a body oriented programming language such as Java, scala, smalltalk, eiffel, JADE, emerald, C ++, c#, vb net, python, etc., a conventional programming language such as C language, visual Basic, fortran2003, perl, COBOL2002, PHP, ABAP, dynamic programming languages such as Python, ruby and Groovy, or other programming languages, etc. The program code may execute entirely on the user's computer, or as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any form of network, such as a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet), or the use of services such as software as a service (SaaS) in a cloud computing environment.

Furthermore, the order in which the elements and sequences are presented, the use of numerical letters, or other designations are used in the application is not intended to limit the sequence of the processes and methods unless specifically recited in the claims. While certain presently useful inventive embodiments have been discussed in the foregoing disclosure, by way of example, it is to be understood that such details are merely illustrative and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover all modifications and equivalent arrangements included within the spirit and scope of the embodiments of the application. For example, while the system components described above may be implemented by hardware devices, they may also be implemented solely by software solutions, such as installing the described system on an existing server or mobile device.

Similarly, it should be noted that in order to simplify the description of the present disclosure and thereby aid in understanding one or more embodiments of the application, various features are sometimes grouped together in a single embodiment, figure, or description thereof. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed subject matter requires more features than are expressly recited in each claim. Indeed, less than all of the features of a single embodiment disclosed above.

Claims

1. A system, comprising:

at least one audio signal input device configured to receive speech of a speaker;

at least one storage medium containing a set of instructions for speech recognition;

At least one processor in communication with the at least one storage medium, wherein the at least one processor, when executing the set of instructions, is configured to:

Obtaining a target audio signal comprising a speaker's speech from the audio signal input device;

determining one or more speech features of the target audio signal;

acquiring at least one accent vector of the speaker;

Acquiring local accents of an origin region of the voice, wherein the local accents comprise regional accents corresponding to the voice as local accents;

Inputting the one or more speech features of the target audio signal, the at least one accent vector of the speaker, and the local accent into a trained speech recognition neural network model to translate the speech into a target content form; and

An interface is generated by an output device to present the speech in the target content form.

2. The system of claim 1, wherein the accent vector comprises at least two elements, each element corresponding to the regional accent and comprising likelihood values associated with the regional accent.

3. The system of claim 2, wherein the at least one accent vector of the speaker is determined based on an accent determination process comprising:

obtaining a historical audio signal comprising one or more historical voices of the speaker;

For each of the one or more historical voices, determining one or more historical voice characteristics of the respective historical audio signal;

Acquiring one or more regional accent models;

For each of the one or more historical voices, inputting one or more respective historical voice features into each of the one or more regional accent models; and

At least two elements of the at least one accent vector of the speaker are determined based on at least one output of the one or more regional accent models.

4. The system of claim 2, wherein the accent of the speaker is determined based on an accent determination process comprising:

acquiring an accent classification model;

for each of the one or more historical voices, inputting a respective historical voice feature into the accent classification model; and

At least two elements of the at least one accent vector of the speaker are determined based on at least one output of the accent classification model.

5. The system of claim 1, wherein the trained speech recognition neural network model is generated by at least one computing device based on a training process comprising:

acquiring a sample audio signal comprising at least two sample voices of at least two sample speakers;

For each of the at least two sample voices, determining one or more sample voice features of the respective sample audio signal;

for each of the at least two sample voices, obtaining at least one sample accent vector of a corresponding sample speaker;

Acquiring an initial neural network model;

The trained speech recognition neural network model is determined by inputting the one or more sample speech features corresponding to each of the at least two sample voices and the at least one sample accent vector of the sample speaker to the initial neural network model.

6. The system of claim 5, wherein said determining said trained speech recognition neural network model comprises:

for each of the at least two sample voices, obtaining a sample local accent of the sample audio signal origin area; and

The trained speech recognition neural network model is determined by inputting the one or more sample speech features corresponding to each of the at least two sample voices, the at least one sample accent vector of the sample speaker, and the sample local accent into the initial neural network model.

7. The system of claim 1, wherein the target content form comprises at least one of a phoneme, syllable, letter.

8. The system of claim 1, wherein to input the one or more speech features of the target audio signal and the at least one accent vector of the speaker to a trained speech recognition neural network model to translate the speech into a target content form, the at least one processor is to:

inputting the one or more speech features and the at least one accent vector of the speaker to the trained speech recognition neural network model; and

The speech is translated into the target content form based on at least one output of the trained neural network model.

9. A method implemented at a computing device, the computing device comprising at least one processor, at least one storage medium, the method comprising:

obtaining a target audio signal including a speaker's voice from an audio signal input device;

determining one or more speech features of the target audio signal;

acquiring at least one accent vector of the speaker;

10. The method of claim 9, wherein the accent vector comprises at least two elements, each element corresponding to the regional accent and comprising likelihood values associated with the regional accent.

11. The method of claim 10, wherein the at least one accent vector of the speaker is determined based on an accent determination process comprising:

Acquiring one or more regional accent models;

The at least two elements of the at least one accent vector of the speaker are determined based on at least one output of the one or more regional accent models.

12. The method of claim 10, wherein the accent of the speaker is determined based on an accent determination process comprising:

acquiring an accent classification model;

The at least two elements of the at least one accent vector of the speaker are determined based on at least one output of the accent classification model.

13. The method of claim 9, wherein the trained speech recognition neural network model is generated by at least one computing device based on a training process comprising:

For each of the at least two sample voices, determining at least one sample accent vector for the respective sample speaker;

Acquiring an initial neural network model;

14. The method of claim 13, wherein determining a trained speech recognition neural network model comprises:

The trained speech recognition neural network model is determined by inputting the one or more sample speech features corresponding to each of the at least two sample voices, the at least one sample accent vector of the sample speaker, and the sample local accent to the initial neural network model.

15. The method of claim 9, wherein the target content form comprises at least one of a phoneme, syllable, letter.

16. The method of claim 9, wherein the inputting the one or more speech features of the target audio signal and the at least one accent vector of the speaker into a trained speech recognition neural network model to translate the speech into a target content form comprises:

17. A non-transitory computer-readable medium comprising executable instructions that, when executed by at least one processor, cause the at least one processor to implement a method comprising:

determining one or more speech features of the target audio signal;

acquiring at least one accent vector of the speaker;

acquiring local accents of an origin region of the voice, wherein the local accents comprise regional accents corresponding to the voice as local accents; inputting the one or more speech features of the target audio signal, the at least one accent vector of the speaker, and the local accent into a trained speech recognition neural network model to translate the speech into a target content form; and