CN112435671B

CN112435671B - Intelligent voice control method and system for accurately recognizing Chinese

Info

Publication number: CN112435671B
Application number: CN202011255685.6A
Authority: CN
Inventors: 林泽森; 黄碧亮
Original assignee: Shenzhen Xiaoshun Intelligent Control Technology Co ltd
Current assignee: Shenzhen Xiaoshun Intelligent Control Technology Co ltd
Priority date: 2020-11-11
Filing date: 2020-11-11
Publication date: 2021-06-29
Anticipated expiration: 2040-11-11
Also published as: CN112435671A

Abstract

The embodiment of the application provides an intelligent voice control method for Chinese accurate recognition, which is applied to a terminal, wherein the method comprises the following steps: the method comprises the steps that a terminal collects voice data of a target object, and the voice data is subjected to feature extraction to obtain input data; the terminal inputs the input data into a first neural network model to perform operation to obtain a first operation result, and inputs the input data into a second neural network model to perform operation to obtain a second operation result; the terminal obtains a first character result of the voice data according to the first operation result, obtains a second character result of the voice data according to the second operation result, compares the first character result with the second character result to determine a final text result of the voice data, and generates a control command corresponding to the text result to realize voice control. The technical scheme provided by the application has the advantage of high user experience.

Description

Intelligent voice control method and system for accurately recognizing Chinese

Technical Field

The application relates to the technical field of voice, in particular to an intelligent voice control method and system for accurately recognizing Chinese.

Background

The speech recognition belongs to the conventional field, and along with the deepening of the speech recognition, more and more scenes are applied to the speech recognition, but the accuracy of the existing speech recognition is lower, and the user experience degree is influenced.

Disclosure of Invention

The embodiment of the application discloses an intelligent voice control method for Chinese accurate recognition, which can improve the accuracy of voice recognition and improve the user experience.

A first aspect of an embodiment of the present application provides an intelligent speech control method for accurate chinese recognition, where the method is applied to a terminal, and the method includes the following steps:

the method comprises the steps that a terminal collects voice data of a target object, and the voice data is subjected to feature extraction to obtain input data;

the terminal inputs the input data into a first neural network model to perform operation to obtain a first operation result, and inputs the input data into a second neural network model to perform operation to obtain a second operation result;

the terminal obtains a first character result of the voice data according to the first operation result, obtains a second character result of the voice data according to the second operation result, compares the first character result with the second character result to determine a final text result of the voice data, and generates a control command corresponding to the text result to realize voice control.

A second aspect of embodiments of the present application provides a terminal comprising a processor, a memory, a communication interface, and one or more programs stored in the memory and configured to be executed by the processor, the programs comprising instructions for performing the steps of the method of the first aspect.

A third aspect of embodiments of the present application discloses a computer-readable storage medium storing a computer program for electronic data exchange, wherein the computer program causes a computer to perform the method of the first aspect.

A fourth aspect of embodiments of the present application discloses a computer program product, wherein the computer program product comprises a non-transitory computer-readable storage medium storing a computer program operable to cause a computer to perform some or all of the steps as described in the first aspect of embodiments of the present application. The computer program product may be a software installation package.

Drawings

The drawings used in the embodiments of the present application are described below.

Fig. 1 is a schematic structural diagram of a terminal according to an embodiment of the present application;

fig. 2 is a schematic flowchart of an intelligent speech control method for chinese accurate recognition according to an embodiment of the present application.

Detailed Description

The embodiments of the present application will be described below with reference to the drawings.

The term "and/or" in this application is only one kind of association relationship describing the associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in this document indicates that the former and latter related objects are in an "or" relationship.

The "plurality" appearing in the embodiments of the present application means two or more. The descriptions of the first, second, etc. appearing in the embodiments of the present application are only for illustrating and differentiating the objects, and do not represent the order or the particular limitation of the number of the devices in the embodiments of the present application, and do not constitute any limitation to the embodiments of the present application. The term "connect" in the embodiments of the present application refers to various connection manners, such as direct connection or indirect connection, to implement communication between devices, which is not limited in this embodiment of the present application.

A terminal in the embodiments of the present application may refer to various forms of UE, access terminal, subscriber unit, subscriber station, mobile station, MS (mobile station), remote station, remote terminal, mobile device, user terminal, terminal device (terminal equipment), wireless communication device, user agent, or user equipment. The terminal device may also be a cellular phone, a cordless phone, an SIP (session initiation protocol) phone, a WLL (wireless local loop) station, a PDA (personal digital assistant) with a wireless communication function, a handheld device with a wireless communication function, a computing device or other processing device connected to a wireless modem, a vehicle-mounted device, a wearable device, a terminal device in a future 5G network or a terminal device in a future evolved PLMN (public land mobile network, chinese), and the like, which are not limited in this embodiment.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a terminal disclosed in an embodiment of the present application, the terminal 100 includes a storage and processing circuit 110, and a sensor 170 connected to the storage and processing circuit 110, where the sensor 170 may include a camera, a distance sensor, a gravity sensor, and the like, the electronic device may include two transparent display screens, the transparent display screens are disposed on a back side and a front side of the electronic device, and part or all of components between the two transparent display screens may also be transparent, so that the electronic device may be a transparent electronic device in terms of visual effect, and if part of the components are transparent, the electronic device may be a hollow electronic device. Wherein:

the terminal 100 may include control circuitry, which may include storage and processing circuitry 110. The storage and processing circuitry 110 may be a memory, such as a hard drive memory, a non-volatile memory (e.g., flash memory or other electronically programmable read-only memory used to form a solid state drive, etc.), a volatile memory (e.g., static or dynamic random access memory, etc.), etc., and the embodiments of the present application are not limited thereto. Processing circuitry in the storage and processing circuitry 110 may be used to control the operation of the terminal 100. The processing circuitry may be implemented based on one or more microprocessors, microcontrollers, digital signal processors, baseband processors, power management units, audio codec chips, application specific integrated circuits, display driver integrated circuits, and the like.

The storage and processing circuitry 110 may be used to run software in the terminal 100, such as an Internet browsing application, a Voice Over Internet Protocol (VOIP) telephone call application, an email application, a media playing application, operating system functions, and so forth. Such software may be used to perform control operations such as camera-based image capture, ambient light measurement based on an ambient light sensor, proximity sensor measurement based on a proximity sensor, information display functionality based on status indicators such as status indicator lights of light emitting diodes, touch event detection based on a touch sensor, functionality associated with displaying information on multiple (e.g., layered) display screens, operations associated with performing wireless communication functionality, operations associated with collecting and generating audio signals, control operations associated with collecting and processing button press event data, and other functions in the terminal 100, to name a few, embodiments of the present application are not limited.

The terminal 100 may include an input-output circuit 150. The input-output circuit 150 may be used to enable the terminal 100 to input and output data, i.e., to allow the terminal 100 to receive data from external devices and also to allow the terminal 100 to output data from the terminal 100 to external devices. The input-output circuit 150 may further include a sensor 170. Sensor 170 vein identification module, can also include ambient light sensor, proximity sensor based on light and electric capacity, fingerprint identification module, touch sensor (for example, based on light touch sensor and/or capacitanc touch sensor, wherein, touch sensor can be touch-control display screen's partly, also can regard as a touch sensor structure independent utility), acceleration sensor, the camera, and other sensors etc. the camera can be leading camera or rear camera, the fingerprint identification module can integrate in the display screen below, be used for gathering the fingerprint image, the fingerprint identification module can be: optical fingerprint module, etc., and is not limited herein. The front camera can be arranged below the front display screen, and the rear camera can be arranged below the rear display screen. Of course, the front camera or the rear camera may not be integrated with the display screen, and certainly in practical applications, the front camera or the rear camera may also be a lifting structure.

Input-output circuit 150 may also include one or more display screens, and when multiple display screens are provided, such as 2 display screens, one display screen may be provided on the front of the electronic device and another display screen may be provided on the back of the electronic device, such as display screen 130. The display 130 may include one or a combination of liquid crystal display, transparent display, organic light emitting diode display, electronic ink display, plasma display, and display using other display technologies. The display screen 130 may include an array of touch sensors (i.e., the display screen 130 may be a touch display screen). The touch sensor may be a capacitive touch sensor formed by a transparent touch sensor electrode (e.g., an Indium Tin Oxide (ITO) electrode) array, or may be a touch sensor formed using other touch technologies, such as acoustic wave touch, pressure sensitive touch, resistive touch, optical touch, and the like, and the embodiments of the present application are not limited thereto.

The terminal 100 can also include an audio component 140. Audio component 140 may be used to provide audio input and output functionality for terminal 100. The audio components 140 in the terminal 100 may include a speaker, a microphone, a buzzer, a tone generator, and other components for generating and detecting sound.

The communication circuit 120 can be used to provide the terminal 100 with the capability to communicate with external devices. The communication circuit 120 may include analog and digital input-output interface circuits, and wireless communication circuits based on radio frequency signals and/or optical signals. The wireless communication circuitry in communication circuitry 120 may include radio-frequency transceiver circuitry, power amplifier circuitry, low noise amplifiers, switches, filters, and antennas. For example, the wireless Communication circuitry in the Communication circuitry 120 may include circuitry to support Near Field Communication (NFC) by transmitting and receiving near field coupled electromagnetic signals. For example, the communication circuit 120 may include a near field communication antenna and a near field communication transceiver. The communications circuitry 120 may also include a cellular telephone transceiver and antenna, a wireless local area network transceiver circuitry and antenna, and so forth.

The terminal 100 may further include a battery, a power management circuit, and other input-output units 160. The input-output unit 160 may include buttons, joysticks, click wheels, scroll wheels, touch pads, keypads, keyboards, cameras, light emitting diodes and other status indicators, and the like.

A user may input commands through input-output circuitry 150 to control operation of terminal 100 and may use output data of input-output circuitry 150 to enable receipt of status information and other outputs from terminal 100.

Referring to fig. 2, fig. 2 provides an intelligent speech control method for accurate chinese recognition, which may be executed by a terminal, and the method is shown in fig. 2 and includes the following steps:

step S201, a terminal collects voice data of a target object, and performs feature extraction on the voice data to obtain input data;

step S202, the terminal inputs the input data into a first neural network model to perform operation to obtain a first operation result, and inputs the input data into a second neural network model to perform operation to obtain a second operation result;

the first neural network model and the second neural network model may be different network models, that is, 2 operation results are obtained by respectively identifying two different models.

The inputting of the input data into the first neural network model operation to obtain the operation result may specifically include:

optimizing the input data to obtain optimized input optimized data, extracting weight data corresponding to the neural network model, optimizing the weight data to obtain optimized weight optimized data, extracting input optimized data element values and weight optimized data element values corresponding to addresses according to addresses of the calculation instructions, and performing operation on the input optimized data element values and the weight optimized data element values to obtain partial or all element values in an operation result of the neural network model operation.

The optimization processing of the input data to obtain optimized input optimization data may specifically include:

setting 4 continuous element values of the input data into a group, wherein the head bit (namely, the 1 st bit is 1) of the first element value in each group of elements, the remaining 3 bits indicate whether the values are the same as the first element values or not, if so, the corresponding bit is set to 1, deleting the stored data of the elements with the same values as the first element values, and traversing all the elements of the input data to obtain optimized input optimized data.

Similarly, the weight data can also be optimized to obtain weight optimized data.

According to the technical scheme, the slimming of the element values can be realized (namely, part of the same element values can be deleted), for example, 4 element values in a group of elements are respectively 2, 1, 2 and 1, then the first 4 bits of the first element value are determined to be 1010, the 1 st bit is 1 to represent that the first element value belongs to the group, because one element value is 32 bits, and for the size of the element value, the probability of using the first 4 bits is almost zero, and the technical scheme of the application reduces the storage space of the group of elements by specially setting the first 4 bits, so that the storage overhead and the IO overhead are reduced.

In an optional scheme, the calculation instruction may specifically be: MABY TYPE, M, N, a, GA, X, Y, b, GB; the method includes the steps that the MABY is an identification of an operation instruction, specifically, the MABY can be a matrix multiplied by a matrix, the TYPE is a TYPE supported by the operation instruction, an optimized data TYPE supported by the scheme of the application is provided, the M is a row value of the matrix A (namely input optimized data), the N is a column value of the matrix A, a is a first address of the matrix A, the GA is the number of group elements extracted by the matrix A in each operation, the X is a row value of the matrix B (namely weight optimized data), the Y is a column value of the matrix B, the B is a first address of the matrix B, and the GB is the number of group elements extracted by the matrix B in each operation.

The extracting, according to the address of the calculation instruction, the input optimized data element value and the weight optimized data element value of the corresponding address, and performing an operation on the input optimized data element value and the weight optimized data element value to obtain a part or all of the element values in the operation result of the neural network model operation may specifically include:

extracting GA element value groups in input optimization data according to a calculation instruction, extracting GB element value groups of matrix optimization data, performing inverse optimization on data of each element value group according to the first 4 bits of each element value group in the GA element value groups to obtain an original input element value, performing inverse optimization on the data of each element value group according to the first 4 bits of each element value group in the GB element value groups to obtain an original weight element value, and performing an operation corresponding to MABY on the original input element value and the original weight element value to obtain a part or all of element values in an operation result of a neural network model operation.

The MABY corresponding operation may be a multiplication operation, or certainly, may also be an inner product operation or a convolution operation, and the specific operation is not limited in this application, and certainly, the identifiers of the calculation instructions corresponding to different operations may be different.

The operation of the above inverse optimization may specifically be that, if the last 3 bits in the first 4 bits of each element value group have 1, the first element value of the element value group is copied at a position corresponding to 1, and then the first 4 bits are all set to 0 to complete inverse optimization to obtain the original element value.

For example, if the 4 bits are 1010, then copying the first element value after the second element of the element value set and inserting it into the 3 rd element position of the element value set results in the original element value.

Step S203, the terminal obtains a first character result of the voice data according to the first operation result, obtains a second character result of the voice data according to the second operation result, compares the first character result with the second character result to determine a final text result of the voice data, and generates a control command corresponding to the text result to realize voice control.

According to the technical scheme, voice data of a target object are collected at a terminal, and the voice data are subjected to feature extraction to obtain input data; the terminal inputs the input data into the first neural network model to execute operation to obtain a first operation result, inputs the input data into the second neural network model to execute operation to obtain a second operation result, obtains a first character result of the voice data according to the first operation result, obtains a second character result of the voice data according to the second operation result, compares the first character result with the second character result to determine a final text result of the voice data, and generates a control command corresponding to the text result to realize voice control. Therefore, the control command of the Chinese recognition result is a result of combining two recognition results, and the accuracy of voice recognition can be improved, so that the method has the advantages of accurate recognition, improvement of the accuracy of voice control and improvement of the user experience.

The determining the final text result of the speech data by comparing the first text result with the second text result may specifically include:

segmenting the first character result according to punctuation marks to obtain n segments, segmenting the second character result according to punctuation marks to obtain n segments, determining the segments with completely same text content in the n segments as the segments of the final text result, filtering the segments with different text content in the n segments to obtain filtered segments, and adding the filtered segments to the segments of the final text result to obtain a complete final text result.

For the first and second textual results, the results are identical in most cases, i.e. the content of most of the n segments is identical, where the identical segments can be directly determined as segments of the final textual result, where the processing of the different segments, i.e. the corresponding filtering, is mainly handled.

In an optional scheme, the filtering the segments with different text contents in the n segments to obtain filtered segments may specifically include:

the method comprises the steps of obtaining y1 characters of an x1 th segment of a first text result, obtaining y2 characters of an x1 th segment of a second text result, aligning the y1 characters with the y2 characters in an end-to-end mode according to punctuations as a reference, determining the same characters after the end-to-end alignment as partial contents of a filtered x1 th segment, comparing different characters after the end-to-end alignment with completely same content of the segment to determine a first repetition time and a second repetition time, determining different characters in y1 files as the residual contents of the filtered x1 th segment if the first repetition time is greater than the second repetition time, and determining different characters in y2 files as the residual contents of the filtered x1 th segment if the second repetition time is greater than the first repetition time.

y1 characters can be specifically, i is lie, y2 characters can be i is you sister, and in this way, the same aligned characters "i is" are determined to belong to partial content of the filtered x1 segment, then the lie and the your sister are compared with the content of the rest segments separately, and then the specifically filtered content is determined, because the voice data is data collected in a time period and has relevance, the probability of the generally inconsistent data appearing in other segments is high, and if the data appearing in other segments, the recognition accuracy is high, so the accuracy of Chinese recognition can be improved.

The application also provides an intelligent speech control system of accurate discernment of chinese, the system includes:

the acquisition unit is used for acquiring voice data of a target object and performing feature extraction on the voice data to obtain input data;

the identification operation unit is used for inputting the input data into a first neural network model to perform operation to obtain a first operation result, and inputting the input data into a second neural network model to perform operation to obtain a second operation result; and obtaining a first character result of the voice data according to the first operation result, obtaining a second character result of the voice data according to the second operation result, comparing the first character result with the second character result to determine a final text result of the voice data, and generating a control command corresponding to the text result to realize voice control.

An embodiment of the present application further provides a computer program product, and when the computer program product runs on a terminal, the method flow shown in fig. 2 is implemented.

Embodiments of the present application also provide a terminal including a processor, a memory, a communication interface, and one or more programs stored in the memory and configured to be executed by the processor, the programs including instructions for performing the steps in the method of the embodiment shown in fig. 2.

The above description has introduced the solution of the embodiment of the present application mainly from the perspective of the method-side implementation process. It will be appreciated that the electronic device, in order to carry out the functions described above, may comprise corresponding hardware structures and/or software templates for performing the respective functions. Those of skill in the art will readily appreciate that the present application is capable of hardware or a combination of hardware and computer software implementing the various illustrative elements and algorithm steps described in connection with the embodiments provided herein. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiment of the present application, the electronic device may be divided into the functional units according to the method example, for example, each functional unit may be divided corresponding to each function, or two or more functions may be integrated into one processing unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit. It should be noted that the division of the unit in the embodiment of the present application is schematic, and is only a logic function division, and there may be another division manner in actual implementation.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are presently preferred and that no acts or templates referred to are necessarily required by the application.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the above-described division of the units is only one type of division of logical functions, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit may be stored in a computer readable memory if it is implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a memory, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the above-mentioned method of the embodiments of the present application. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable memory, which may include: flash Memory disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the method and the core concept of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. An intelligent voice control method for Chinese accurate recognition is characterized in that the method is applied to a terminal, wherein the method comprises the following steps:

the terminal obtains a first character result of the voice data according to the first operation result, obtains a second character result of the voice data according to the second operation result, compares the first character result with the second character result to determine a final text result of the voice data, and generates a control command corresponding to the text result to realize voice control; the step of comparing the first word result with the second word result to determine a final text result of the speech data specifically includes:

segmenting the first character result according to punctuation marks to obtain n segments, segmenting the second character result according to punctuation marks to obtain n segments, determining the segments with completely same text content in the n segments as the segments of the final text result, filtering the segments with different text content in the n segments to obtain filtered segments, and adding the filtered segments to the segments of the final text result to obtain a complete final text result; the filtering the segments with different text contents in the n segments to obtain the filtered segments specifically comprises:

2. A computer-readable storage medium, characterized in that a computer program for electronic data exchange is stored, wherein the computer program causes a computer to perform the method according to claim 1.

3. An intelligent speech control system for accurate recognition of the chinese language, the system comprising:

the identification operation unit is used for inputting the input data into a first neural network model to perform operation to obtain a first operation result, and inputting the input data into a second neural network model to perform operation to obtain a second operation result; obtaining a first character result of the voice data according to the first operation result, obtaining a second character result of the voice data according to the second operation result, comparing the first character result with the second character result to determine a final text result of the voice data, and generating a control command corresponding to the text result to realize voice control;

the identification operation unit is specifically used for segmenting the first character result according to punctuation marks to obtain n segments, segmenting the second character result according to punctuation marks to obtain n segments, determining the segments with completely identical text contents in the n segments as the segments of the final text result, filtering the segments with different text contents in the n segments to obtain filtered segments, and adding the filtered segments to the segments of the final text result to obtain a complete final text result; the filtering the segments with different text contents in the n segments to obtain the filtered segments specifically comprises: