CN111857646A

CN111857646A - System for quickly realizing voice interaction function

Info

Publication number: CN111857646A
Application number: CN202010779872.8A
Authority: CN
Inventors: 刘重凯; 李旭滨
Original assignee: Shanghai Maosheng Intelligent Technology Co ltd
Current assignee: Shanghai Maosheng Intelligent Technology Co ltd
Priority date: 2020-08-05
Filing date: 2020-08-05
Publication date: 2020-10-30

Abstract

The application relates to a system for quickly realizing a voice interaction function, wherein the system for quickly realizing the voice interaction function acquires first analog voice data of a user through an I2S standard microphone array; the voice conversion unit converts the first analog voice data into first digital voice data and converts the second digital voice data into second analog voice data; the voice processing unit carries out full-link voice processing on the first digital voice data to generate second digital voice data corresponding to the first digital voice data, wherein the voice processing unit independently runs in an operating system of the android device; the I2S standard player plays the second analog voice data. Through the method and the device, the problems that the voice interaction function is high in development difficulty, long in period, free of transportability and high in complexity in the prior art are solved, and the technical effect of rapidly developing the voice interaction function is achieved.

Description

System for quickly realizing voice interaction function

Technical Field

The present application relates to the field of voice interaction, and in particular, to a system for quickly implementing a voice interaction function.

Background

Along with the popularization of artificial intelligence technique, speech recognition technique also develops more and more fast, and is applied to on the tall and erect equipment of various ann for promote tall and erect equipment control convenience and human-computer interaction experience.

The existing development process of the voice interaction function of the android device generally comprises the following steps:

hardware selection, namely selecting corresponding hardware such as a Central Processing Unit (CPU), a Double Data Synchronous dynamic Random Access Memory (DDR SDRAM), a microphone, a loudspeaker and the like according to the requirements of the voice interaction function on resources such as computing power, a Random Access Memory (RAM), a Read-Only Memory (ROM) and the like;

the method comprises the following steps of carrying out recognition engine transplantation, wherein the speech recognition engine is transplanted to the android device under the condition of cross compiling and debugging the algorithm of the speech recognition engine according to an operating system deployed by the android device;

recording and playback development tests, namely, developing and debugging the selected hardware, for example, under the condition of testing the recording quality of a microphone array, gain test and consistency test need to be carried out on the microphone, and development and debugging are also needed on a driver;

the upper layer applies voice development, integrates the debugging of the voice recognition engine, the recording capability and the sound reproduction capability, namely, a microphone array is called to collect the instruction recording of a user, the instruction recording is sent to the voice recognition engine to obtain an instruction text, then the voice recognition engine analyzes the instruction text to execute the execution action corresponding to the instruction text, and then the instruction action is broadcasted;

in the development of the application voice capability, the service requirements are gradually realized on the basis of the application voice capability, that is, various service functions such as GUI (Graphical user interface) animation rendering and anthropomorphic broadcast response are developed based on the application voice capability.

In the related art, under the condition of developing a voice interaction function, the difficulty is generally high, and a software engineer, a hardware engineer and an algorithm engineer are required to participate in design and implementation together; under the condition of developing the voice interaction function, the period from design and realization to tuning is long, and a large amount of time cost is consumed; aiming at hardware and product requirements of different android devices, a voice control part in the voice recognition engine has no portability; under the condition of developing a voice interaction function, a software engineer needs to realize the recognition broadcasting capability of voice recognition invariance and also needs to realize specific service requirements, and the software engineer is difficult to develop in parallel and has higher complexity.

At present, no effective solution is provided for the problems of high difficulty in developing voice interaction functions, long period, no portability and high complexity in the related technology.

Disclosure of Invention

The embodiment of the application provides a system for quickly realizing a voice interaction function, so as to at least solve the problems of high development difficulty, long period, no portability and high complexity of the voice interaction function in the related technology.

The invention provides a system for quickly realizing a voice interaction function, which is applied to android equipment and comprises the following components:

the I2S standard microphone array is used for acquiring first analog voice data of a user;

a voice conversion unit for converting the first analog voice data into first digital voice data and converting the second digital voice data into second analog voice data;

the voice processing unit is used for performing full-link voice processing on the first digital voice data to generate second digital voice data corresponding to the first digital voice data, wherein the voice processing unit independently runs on an operating system of the android device, and the full-link voice processing comprises voice recognition, semantic understanding, dialogue management, natural language generation and text-to-voice;

an I2S standard player for playing the second analog voice data.

Further, still include:

an I2S standard interface for receiving the first analog voice data sent by the I2S standard microphone array and sending the first analog voice data to the voice conversion unit, and receiving the second analog voice data sent by the voice conversion unit and sending the second analog voice data to the I2S standard player.

Further, the voice processing unit includes:

and the voice recognition module is used for recognizing the first digital voice data to obtain a user intention and sending the user intention to an application layer of the android device, wherein the voice recognition module has portability and can enable the voice processing unit to run on a plurality of processor architectures.

Further, the voice processing unit further includes:

and the voice synthesis module is used for synthesizing the second digital voice data according to the execution action result under the condition that the execution action result of the execution action corresponding to the user intention is generated by the application layer of the android device.

Further, the voice processing unit further includes:

a configuration module for configuring parameters of the I2S standard microphone array.

Further, still include:

and the JNI standard dynamic link library is used for receiving the first digital voice data sent by the voice conversion unit and sending the first digital voice data to the voice processing unit.

Further, the voice processing unit further includes:

and the communication interface is used for carrying out interprocess communication with the application layer of the android device.

Further, the communication interface includes:

and the calling interface is used for calling the voice processing unit to execute the execution action corresponding to the first digital voice data.

Further, the communication interface further comprises:

and the event notification interface is used for notifying the execution result to the application layer of the android device.

Further, the voice conversion unit may be a Tinyalsa audio driver.

Compared with the related art, the system for rapidly realizing the voice interaction function provided by the embodiment of the application comprises the following steps: the I2S standard microphone array is used for acquiring first analog voice data of a user; a voice conversion unit for converting the first analog voice data into first digital voice data and converting the second digital voice data into second analog voice data; the voice processing unit is used for carrying out full-link voice processing on the first digital voice data so as to generate second digital voice data corresponding to the first digital voice data, wherein the voice processing unit independently runs in an operating system of the android device, and the full-link voice processing comprises voice recognition, semantic understanding, dialogue management, natural language generation and text-to-voice; and an I2S standard player for playing the second analog voice data. The problems that in the prior art, the development difficulty of the voice interaction function is high, the period is long, the transportability is not available and the complexity is high are solved, and the technical effect of rapidly developing the voice interaction function is realized.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a first block diagram illustrating a system for quickly implementing a voice interaction function according to an embodiment of the present invention;

FIG. 2 is a block diagram of a system for quickly implementing a voice interaction function according to an embodiment of the present invention;

FIG. 3 is a block diagram of a system for quickly implementing a voice interaction function according to an embodiment of the present invention;

FIG. 4 is a block diagram of a system for quickly implementing a voice interaction function according to an embodiment of the present invention;

FIG. 5 is a block diagram of a system for quickly implementing a voice interaction function according to an embodiment of the present invention;

fig. 6 is a block diagram six of a structure of a system for quickly implementing a voice interaction function according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application.

It is obvious that the drawings in the following description are only examples or embodiments of the present application, and that it is also possible for a person skilled in the art to apply the present application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.

Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as referred to herein means two or more. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.

Fig. 1 is a structural block diagram of a system for quickly implementing a voice interaction function according to an embodiment of the present invention, and fig. 1 shows that the system includes an I2S (Inter-ICSound, integrated circuit built-in audio bus) standard microphone array 110, a voice conversion unit 120, a voice processing unit 130, and an I2S standard player 140.

The I2S standard microphone array 110 is used to collect first analog voice data of the user and send the first analog voice data to the voice conversion unit 120. The first analog voice data collected by the I2S standard microphone array 110 is an analog signal and cannot be directly identified by the android device, so the first analog voice data needs to be sent to the voice conversion unit 120 to acquire a signal type that can be identified by the android device.

The number and parameters of the microphones of the I2S standard microphone array 110 can be configured according to product requirements, so that the I2S standard microphone array 110 can clearly receive the voice data of the user, and voice interaction is prevented from being affected due to unclear voice data of the user received by the I2S standard microphone array 110.

The voice conversion unit 120 is configured to convert the first analog voice data into first digital voice data, and send the first digital voice data to the voice processing unit 130. Specifically, since the first analog voice data is an analog signal and cannot be directly recognized by the android device, the voice conversion unit 120 converts the first analog voice data into first digital voice data, where the first digital voice data is a digital signal.

The voice processing unit 130 performs full-link voice processing on the first digital voice data to generate second digital voice data corresponding to the first digital voice data, and then the voice processing unit 130 transmits the second digital voice data to the voice converting unit 120; full link speech processing includes speech recognition, semantic understanding, dialog management, natural language generation, and text-to-speech, among others.

Specifically, the voice processing unit 130 performs voice recognition on the first digital voice data to generate text data; performing semantic understanding on the text data, and finding a conversation process corresponding to the text data according to conversation management; then generating response text data according to the conversation process; the response text data is converted into second digital voice data. If the first digital voice data is the execution action, the second digital voice data is the execution action result; in the case where the first digital voice data is a question, the second digital voice data is an answer to the question.

For example, in the case where the first digital voice data input by the user is "play music", the voice processing unit 130 performs full-link voice processing on the first digital voice data at this time, and generates second digital voice data "music has been played for you".

The first analog voice data and the second analog voice data are analog signals; the first digital voice data and the second digital voice data are both digital signals.

The voice conversion unit 120 converts the second digital voice data into second analog voice data and sends the second analog voice data to the I2S standard player 140, since the I2S standard player 140 cannot directly recognize the digital signal, the I2S standard player 140 cannot directly recognize and play the second digital voice data, and the voice conversion unit 120 is required to convert the second digital voice data into the second analog voice data, that is, to convert the digital signal into the analog signal; the I2S standard player 140 then receives and plays the second analog voice data.

The voice conversion unit 120 may be a Tinyalsa audio driver, which is an audio architecture in an operating system of an android device, and can control and manage audio channels in multiple modes, and the operating system of the android device can complete operations on bottom hardware through the Tinyalsa audio driver. For example, the Tinyalsa audio driver may receive first analog voice data transmitted by the I2S standard microphone array 110, convert the first analog voice data into first digital voice data, and transmit the first digital voice data to the voice processing unit 130, and the voice processing unit 130 performs full-link voice processing on the first digital voice data to generate second digital voice data; the Tinyalsa audio driver may also receive the second digital voice data transmitted by the voice processing unit 130, convert the second digital voice data into second analog voice data, and then transmit the second analog voice data to the I2S standard player 140, so that the I2S standard player 140 plays the second analog voice data.

The audio channels of various audio modes can be controlled and managed by the Tinyalsa audio driver, so that the voice conversion unit 120 does not need to be debugged in a long time, the development period of the voice interaction function is shortened, and the development difficulty of the voice interaction function is reduced.

The voice processing unit 130 independently runs in an operating system of the android device, so that the voice processing unit 130 is completely decoupled from an application layer of the android device, the work of joint editing and joint debugging of the voice processing unit and an upper application is omitted, and the development difficulty of a voice interaction function is reduced.

For example, in the case that the voice processing unit 130 is a voice assistant program, the voice assistant program receives first digital voice data sent by Tinyalsa audio driver, and then performs voice recognition, semantic understanding, dialog management, natural language generation, and text-to-speech on the first digital voice data, and then sends the processing result to an application layer of the android device, and the application layer executes a corresponding execution action according to the processing result; the voice processing unit 130 generates second digital voice data according to the execution action result in the case where the action is executed by the application layer, and transmits the second digital voice data to the Tinyalsa audio driver.

And an I2S standard player 140 for receiving the converted second analog voice data from the Tinyalsa audio driver and playing the second analog voice data. Wherein, an appropriate I2S standard player 140 can be selected according to product requirements, so as to prevent the player 140 from having an unclear playing problem under the condition of receiving the second analog voice data.

By using a system consisting of a relatively mature standardized I2S standard microphone array 110, a voice conversion unit 120, a voice processing unit 130 and an I2S standard player 140, the system serves as an independent whole for providing services for an application layer of an android device, and the time for debugging and trial and error is saved; and because the voice processing unit 130 is used as an independent process to run and operate an operating system of the android device, the work of joint editing and joint debugging of the voice processing unit 130 and an application layer is saved, and the development difficulty and the development period of the voice assistant function are further reduced.

Fig. 2 is a block diagram of a structure of a system for quickly implementing a speech function according to an embodiment of the present invention, please refer to fig. 2, the system further includes:

the I2S standard interface 150 is configured to receive the first analog voice data transmitted by the I2S standard microphone array 110 and transmit the first analog voice data to the voice conversion unit 120, and receive the second analog voice data transmitted by the voice conversion unit 120 and transmit the second analog voice data to the I2S standard player 140.

The I2S standard is a bus standard established for data transmission of digital audio equipment, and the I2S standard defines both hardware interface specifications and formats of digital audio data, so that differences of different hardware are shielded and processed by using the I2S standard microphone array 110, the I2S standard interface 150 and the I2S standard player 140, the problem that development and testing of recording and playing costs a lot of time in the process of developing a voice interaction function is solved, and the I2S standard microphone array 110 and the I2S standard player 140 are assembled and used immediately.

Fig. 3 is a block diagram of a system for quickly implementing a voice interaction function according to an embodiment of the present invention, and referring to fig. 3, the voice processing unit 130 includes:

the voice recognition module 131 is used for recognizing the first digital voice data to obtain the user intention and sending the user intention to the application layer, and the voice recognition module has portability and can enable the voice processing unit to run on a plurality of processor architectures.

The voice recognition module 131 may be a voice recognition engine, and the voice recognition module 131 may perform performance tuning for the mainstream central processor architecture, so that the voice processing unit 130 can run on the mainstream processor architecture. For example, the voice recognition module 131 can be optimized for performance, so that the voice processing unit 130 can operate on the Intelx86 architecture, the arm (acorn RISC machine) architecture, and the mips (microprocessor with interleaved pipeline) architecture.

The performance of the speech recognition module 131 is optimized, and the speech recognition module 131 can support cross-compilation of mainstream speech chips.

The problem that the voice interaction function in the prior art is not portable is solved by the fact that the voice recognition module 131 can run in various central processing unit architectures and can support cross compiling of various voice chips.

Fig. 4 is a block diagram of a system for quickly implementing a voice interaction function according to an embodiment of the present invention, and referring to fig. 4, the voice processing unit 130 further includes:

and a speech synthesis module 132, configured to synthesize the second digital speech data according to an execution action result when the execution action result of the execution action corresponding to the user intention is generated by the application layer of the android device.

In the case that the speech recognition module 131 sends the user intention to the application layer of the android device, the application layer executes an execution action corresponding to the user intention and generates an execution action result, and then sends the execution action result to the speech synthesis module 132, and the speech synthesis module 132 generates second digital speech data according to the execution action result.

The execution action corresponding to the user intention is executed through the application layer of the android device of the voice synthesis module 132, and the second digital voice data is synthesized according to the execution action result, so that the android device can realize a voice interaction function.

Fig. 5 is a block diagram of a system for quickly implementing a voice interaction function according to an embodiment of the present invention, and referring to fig. 5, the voice processing unit 130 further includes:

a configuration module 133 for configuring parameters of the I2S standard microphone array 110.

The parameters of the I2S standard microphone array 110 are configured by the configuration module 133 so that the I2S standard microphone array 110 can clearly receive the speech information of the user.

Fig. 6 is a block diagram six of a structure of a system for quickly implementing a voice interaction function according to an embodiment of the present invention, please refer to fig. 6, the system further includes:

the JNI standard dynamic link library 160 is configured to receive the first digital voice data sent by the voice conversion unit 120 and send the first digital voice data to the voice processing unit 130.

The JNI standard dynamic link library 160 is a dynamic link library 160 based on a JNI (Java Native Interface ) standard, and since a program written by the JNI standard can be transplanted on different platforms, the voice interaction function can also have certain portability.

The voice processing unit 130 further includes a communication interface for performing interprocess communication with an application layer of the android device, and the communication interface is divided into an uplink event notification interface and a downlink call interface. The communication interface is mainly used for enabling the voice processing unit 130 to perform interprocess communication with an application layer of the android device, so that the android device can realize a voice interaction function.

The calling interface is used for controlling the voice processing unit 130 to execute an execution action corresponding to the first digital voice data. Specifically, in the case where the speech processing unit 130 is a voice assistant program, the application layer may call the voice assistant program through a call interface to perform operations such as recording, speech recognition, speech synthesis, speech playback, and parameter configuration.

The event notification interface is used for notifying an execution result to an application layer of the android device, for example, in the case that the voice processing unit 130 executes voice recognition of the first digital information, the voice processing unit 130 may send the voice recognition result to the application layer through the event notification interface; in the case where the voice processing unit 130 performs voice synthesis, the voice synthesis result may also be transmitted to the application layer.

In addition, the event notification interface can also notify the application layer of the start and the end of audio playing, hardware and network abnormality and other events.

Through the operating system of speech processing unit 130 independent operation and android device, and can provide the pronunciation assistant interface of a set of standard to make the development engineer of application need not to pay close attention to the realization mode of pronunciation ability, only need the communication interface of butt joint speech processing unit 130's standard, and then make the development engineer can be absorbed in the development of business renewal, reduced development engineer's the development degree of difficulty.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. The utility model provides a system for realize voice interaction function fast, is applied to android device, which is characterized in that includes:

an I2S standard player for playing the second analog voice data.

2. The system of claim 1, further comprising:

3. The system of claim 1, wherein the speech processing unit comprises:

4. The system of claim 3, wherein the speech processing unit further comprises:

5. The system of claim 1, wherein the speech processing unit further comprises:

6. The system of claim 1, further comprising:

7. The system of claim 1, wherein the speech processing unit further comprises:

8. The system of claim 7, wherein the communication interface comprises:

9. The system of claim 8, wherein the communication interface further comprises:

10. The system of claim 1, wherein the speech conversion unit is a Tinyalsa audio driver.