CN111161718A

CN111161718A - Voice recognition method, device, equipment, storage medium and air conditioner

Info

Publication number: CN111161718A
Application number: CN201811323620.3A
Authority: CN
Inventors: 刘文峰
Original assignee: Gree Electric Appliances Inc of Zhuhai
Current assignee: Gree Electric Appliances Inc of Zhuhai
Priority date: 2018-11-07
Filing date: 2018-11-07
Publication date: 2020-05-15

Abstract

The application relates to a voice recognition method, a voice recognition device, equipment, a storage medium and an air conditioner, wherein the voice recognition device comprises the following steps: acquiring voice information; sending the voice information to a target voice recognition model obtained by pre-training; the accent identified by the target speech identification model is matched with the target accent, and the target accent is the accent corresponding to the geographical position of the equipment; and recognizing the voice information by the target voice recognition model to obtain text information of the voice information. Because the target voice model is matched with the accent used in the geographic position, the target voice model has a high recognition rate of the accent, and on the basis, the technical scheme of the application has a relatively ideal recognition rate of the accent.

Description

Voice recognition method, device, equipment, storage medium and air conditioner

Technical Field

The application relates to the technical field of human-computer interaction, in particular to a voice recognition method, a voice recognition device, voice recognition equipment, a storage medium and an air conditioner.

Background

With the development of science and technology, the interaction modes between people and machines are more and more diversified, wherein the machines which are widely applied at present carry out human-computer interaction by recognizing human voices.

Since a language is composed of a large number of dialects, and the accents of each person are different for a single dialect, it is difficult for the conventional speech recognition technology to achieve an ideal recognition rate when recognizing an accented speech, particularly in a remote area where dialects are complicated.

Disclosure of Invention

To overcome at least some of the problems of the related art, the present application provides a voice recognition method, apparatus, device, storage medium, and air conditioner.

According to a first aspect of the present application, there is provided a speech recognition method comprising:

acquiring voice information; the voice information comprises an accent;

sending the voice information to a preset target voice recognition model; the accent identified by the target speech recognition model is matched with a target accent, and the target accent is an accent used by the geographical position of the equipment;

and recognizing the voice information by the target voice recognition model to obtain text information of the voice information.

Optionally, the target speech recognition model comprises a standard accent inference model and a target accent inference model;

the recognizing the voice information by the target voice recognition model to obtain the text information of the voice information comprises the following steps:

sending the voice information to a standard accent inference model obtained by pre-training and a target accent inference model obtained by pre-training, and respectively obtaining first text information and second text information of the voice information correspondingly; the accent identified by the target accent inference model matches a target accent, the target accent being an accent used by the geographic location of the device;

respectively determining a first matching degree of the first text information and the voice information and a second matching degree of the second text information and the voice information;

and outputting text information corresponding to the matching degree with higher matching degree in the first matching degree and the second matching degree.

Optionally, the outputting text information corresponding to a higher matching degree of the first matching degree and the second matching degree includes:

when the first matching degree and the second matching degree are lower than a preset value, uploading the voice information to a server so that the server can be matched with an optimal accent inference model according to the voice information;

obtaining the optimal accent inference model from the server;

sending the voice information to the optimal accent inference model;

and recognizing the voice information by the optimal accent inference model to obtain text information of the voice information.

Optionally, the method further includes:

receiving standard accent information and target accent information of the same keyword sent by a user;

sending the standard accent information to a standard accent inference model obtained by pre-training to obtain text information of the standard accent information;

setting the text information of the standard accent information as the text information corresponding to the target accent information;

acquiring a target geographic position;

and comparing and clustering phonemes of the dialect accent voice information with the same keywords by a clustering algorithm, and forming a dialect boundary according to the target geographic position corresponding to the dialect accent voice information so as to form an accent map.

Optionally, the training process of the target accent inference model includes:

acquiring the target accent information and the text information corresponding to the target accent information;

and training a pre-established deep learning model by taking the target accent information and the text information corresponding to the target accent information as training samples to obtain the target accent inference model.

Optionally, the obtaining the target geographic location includes:

acquiring a target geographical position of equipment;

or the like, or, alternatively,

and acquiring the target geographic position input by the user.

Optionally, the obtaining of the target geographic location of the device includes:

and acquiring the geographical position of the adjacent mobile terminal connected with the equipment as the target geographical position.

According to a second aspect of the present application, there is provided a speech recognition apparatus comprising:

the acquisition module is used for acquiring voice information; the voice information comprises an accent;

the transmitting module is used for matching the accent identified by the target speech recognition model with a target accent, wherein the target accent is the accent used by the geographical position of the equipment;

and the recognition module is used for recognizing the voice information by the target voice recognition model to obtain text information of the voice information.

the identification module comprises:

the recognition unit is used for sending the voice information to a standard accent inference model obtained by pre-training and a target accent inference model obtained by pre-training, and respectively obtaining a first keyword and a second keyword of the voice information correspondingly; matching the accent identified by the target speech recognition model with a target accent, wherein the target accent is an accent used by the geographical position of the equipment;

the matching unit is used for respectively determining a first matching degree of the first text information and the voice information and a second matching degree of the second text information and the voice information;

and the output unit is used for outputting the text information corresponding to the matching degree with higher matching degree in the first matching degree and the second matching degree.

Optionally, the output unit includes:

the uploading subunit is used for uploading the voice information to a server when the first matching degree and the second matching degree are both lower than a preset value, so that the server matches an optimal accent inference model according to the voice information;

a downloading subunit, configured to obtain the optimal accent inference model from the server;

a sending subunit, configured to send the voice information to the optimal accent inference model;

and the recognition subunit is used for recognizing the voice information by the optimal accent inference model to obtain text information of the voice information.

Optionally, the method further includes an accent map generation module, where the accent map generation module includes:

the receiving unit is used for receiving standard accent information and target accent information of the same keyword sent by a user;

the recognition unit is used for sending the standard accent voice information to a standard accent inference model obtained by pre-training to obtain text information of the standard accent information;

the setting unit is used for setting the text information of the standard accent information as the text information corresponding to the target accent information;

the second acquisition unit is used for acquiring the target geographic position;

and the generating unit is used for comparing and clustering phonemes of the dialect accent voice information with the same keywords through a clustering algorithm, and forming a dialect boundary according to the target geographic position corresponding to the dialect accent voice information so as to form an accent map.

Optionally, the method further includes a training module of the target accent inference model, where the training module of the target accent inference model includes:

the device comprises a first acquisition unit, a second acquisition unit and a processing unit, wherein the first acquisition unit is used for acquiring target accent information and text information corresponding to the target accent information;

and the training unit is used for training a pre-established deep learning model by taking the target accent information and the text information corresponding to the target accent information as training samples to obtain the target accent inference model.

Optionally, the second obtaining unit includes:

the first acquisition subunit is used for acquiring the target geographic position of the equipment;

or the like, or, alternatively,

and the second acquisition subunit is used for acquiring the target geographic position input by the user.

Optionally, the first obtaining subunit is specifically configured to obtain a geographic location of a neighboring mobile terminal connected to the device as the target geographic location.

According to a third aspect of the present application, there is provided an apparatus for speech recognition, comprising:

a processor, and a memory coupled to the processor;

the memory is for storing a computer program for performing at least a speech recognition method as follows:

acquiring voice information; the voice information comprises an accent;

obtaining the optimal accent inference model from the server;

sending the voice information to the optimal accent inference model;

Optionally, the method further includes:

acquiring a target geographic position;

Optionally, the training process of the target accent inference model includes:

Optionally, the obtaining the target geographic location includes:

acquiring a target geographical position of equipment;

or the like, or, alternatively,

and acquiring the target geographic position input by the user.

The processor is used for calling and executing the computer program in the memory.

According to a fourth aspect of the present application, there is provided an air conditioner comprising a speech recognition device as described in the third aspect of the present application.

According to a fifth aspect of the present application, there is provided a storage medium storing a computer program which, when executed by a processor, implements the speech recognition method according to the first aspect of the present application.

The technical scheme provided by the application can comprise the following beneficial effects: after the voice information is obtained, the voice information is sent to a target voice recognition model obtained through pre-training, then the voice information is recognized through the target voice recognition model, text information of the voice information is obtained, the voice information comprises accents, wherein the accents recognized through the target voice recognition model are matched with the accents used at the geographical position where the equipment is located, therefore, when the dialect accents in the geographical position range where the equipment is located are used for man-machine interaction, the target voice model is matched with the accents used at the geographical position, the recognition rate of the dialect accents through the target voice model is high, and accordingly, the technical scheme has a relatively ideal recognition rate of the dialect accents.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

Fig. 1 is a flowchart illustrating a speech recognition method according to an embodiment of the present application.

Fig. 2 is a schematic structural diagram of a speech recognition apparatus according to a second embodiment of the present application.

Fig. 3 is a schematic structural diagram of a speech recognition device according to a third embodiment of the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

Example one

Referring to fig. 1, fig. 1 is a schematic flowchart illustrating a speech recognition method according to an embodiment of the present application.

As shown in fig. 1, the speech recognition method provided in this embodiment includes:

step 11, acquiring voice information; the voice information includes an accent;

step 12, sending the voice information to a target voice recognition model obtained by pre-training; matching the accent identified by the target speech recognition model with the target accent, wherein the target accent is the accent used by the geographical position of the equipment;

and step 13, recognizing the voice information by the target voice recognition model to obtain text information of the voice information.

After the voice information is obtained, the voice information is sent to a target voice recognition model obtained through pre-training, and then the target voice recognition model recognizes the voice information to obtain the keywords of the voice information. The accent recognized by the target voice recognition model is matched with the target accent, and the target accent is an accent used by the geographical position of the equipment, so that when the dialect accent in the geographical position range of the equipment is used for man-machine interaction, the target voice model is matched with the accent used by the geographical position, and therefore the recognition rate of the target voice model to the dialect accent is high.

The text information may be computer language text information, or text information or keywords in any language.

The target speech recognition model may include a standard accent inference model and a target accent inference model. The standard accent inference model is a speech recognition model trained for the sample based on speech information of Mandarin, and the target accent inference model is a speech recognition model trained for the sample based on speech information of dialect accent.

In step 13, the process of recognizing the speech information by the target speech recognition model to obtain the keyword may include the following steps:

sending the voice information to a standard accent inference model obtained by pre-training and a target accent inference model obtained by pre-training, and respectively obtaining a first keyword and a second keyword of the voice information correspondingly; matching the accent identified by the target speech recognition model with the target accent, wherein the target accent is the accent used by the geographical position of the equipment;

and outputting the keywords corresponding to the matching degree with higher matching degree in the first matching degree and the second matching degree.

The following describes a process of recognizing voice information by the target voice recognition model to obtain a keyword, taking "turn on air conditioner" voice information of an accent used in a geographical location of a device as an example.

After receiving the speech information, the pre-trained standard accent inference model recognizes the first keyword, and since the speech information in this example is an accent used by the geographical location of the device, which is different from mandarin, the matching degree between the first keyword and the speech information may be 60% in the recognition process, which is referred to as a first matching degree here.

After receiving the voice information, the pre-trained target accent inference model obtains a second keyword through recognition, and since the target accent inference model is matched with the accent used by the geographic position of the device, in the recognition process, the matching degree of the second keyword and the voice information may be 90%, and the matching degree is recorded as a second matching degree.

As can be seen from the above, since the second matching degree is higher than the first matching degree, the second keyword corresponding to the second matching degree is output as the last keyword.

Due to the popularization of Mandarin, the situation of using Mandarin in families is more and more common, so that the speech information is simultaneously recognized by using the standard accent inference model and the target accent inference model, the keywords with high matching degree are output, and the recognition speed can be effectively accelerated.

Further, the step of outputting the keyword corresponding to the higher matching degree of the first matching degree and the second matching degree may include:

when the first matching degree and the second matching degree are lower than a preset value, uploading the voice information to a server so that the server matches the optimal accent inference model according to the voice information;

acquiring an optimal accent inference model from a server;

sending the voice information to the optimal accent inference model;

and recognizing the voice information by the optimal accent inference model to obtain the keywords of the voice information.

When the first matching degree and the second matching degree are both lower than a preset value, the standard accent inference model and the target accent inference model are not the best matching accent inference model, at the moment, the voice information can be uploaded to a server, the server is matched with the best accent inference model, and then the voice information is sent to the best accent inference model; and recognizing the voice information by the optimal accent inference model to obtain the keywords of the voice information.

In the present family, dialect accents from different places are often used simultaneously, so that the target accent inference model matched with the accent used by the geographical position of the equipment is not suitable for the dialect accents from different places any more, and the problem can be effectively solved by adopting the steps.

The above models are obtained by training in advance, and the following describes training of the models by taking the training process of the target accent estimation model as an example.

The training process of the target accent inference model may be:

acquiring target accent information and text information corresponding to the target accent information;

and training a pre-established deep learning model by taking the target accent information and the text information corresponding to the target accent information as training samples to obtain a target accent inference model.

The target accent information and the text information corresponding to the target accent information may be obtained from a pre-generated accent map or may be obtained in the formation process of the accent map, the target accent information may be speech information of dialect accents at a certain geographic location, and the text information may indicate keywords or contents of the target speech information. The process of establishing the deep learning model is the prior art, and the detailed content is not repeated in the application.

In addition, the embodiment may further include a step of generating an accent map, where the step may include:

acquiring a target geographic position;

It should be noted that the user who sends out the standard accent information and the target accent information of the same keyword may be a user who grasps multiple dialects or languages, and the user sends out the standard accent information and the target accent information of the same keyword in a voluntary manner, so that huge manpower is not spent to actively collect the accent information, and manpower and material resources for collecting the accent information are saved.

Furthermore, the target geographic position may be obtained in various manners, and the target geographic position where the device is located may be directly obtained, or the target geographic position input by the user may also be obtained. When the user is a non-local resident, the geographic position corresponding to the dialect accent of the user is often different from the geographic position of the device, and at the moment, the user can directly input the place of the dialect as the target geographic position, so that the accuracy of data in the accent map can be improved.

In addition, since some devices do not have the function of positioning, the target geographic location can be determined by the geographic location of the neighboring mobile device to which the device is connected.

In addition, when the steps of the method are performed for the first time, the target speech recognition model preset in step 12 may be obtained by the device from the server according to the geographical location where the device is located, or may be obtained by the user directly sending a request to the server through the device.

The accent map may include a dialect boundary and a target accent inference model corresponding to an area included after the dialect boundary is formed.

It should be noted that, in order to ensure the accuracy of voice clustering, the accuracy of pronunciation and dialect adaptation for dialect by telephone, mail and other manual means is allowed to complain to the manager, and the incorrect dialect voice is eliminated by the manager by means of manual intervention.

Example two

Referring to fig. 2, fig. 2 is a schematic structural diagram of a speech recognition device according to a second embodiment of the present application.

As shown in fig. 2, the speech recognition apparatus provided in this embodiment includes:

an obtaining module 21, configured to obtain voice information; the voice information includes accent information;

a sending module 22, configured to send the voice information to a target voice recognition model obtained through pre-training; the target voice recognition model is obtained by training according to the accent information corresponding to the geographic position of the equipment;

and the recognition module 23 is configured to recognize the voice information by the target voice recognition model to obtain a keyword of the voice information.

Optionally, the target speech recognition model includes a standard accent inference model and a target accent inference model;

the identification module comprises:

the recognition unit is used for sending the voice information to a standard accent inference model obtained by pre-training and a target accent inference model obtained by pre-training, and respectively obtaining a first keyword and a second keyword of the voice information correspondingly; matching the accent identified by the target speech recognition model with the target accent, wherein the target accent is the accent used by the geographical position of the equipment;

the first acquisition unit is used for acquiring target accent information and text information corresponding to the target accent information;

and the training unit is used for training a pre-established deep learning model by taking the target accent information and the text information corresponding to the target accent information as training samples to obtain a target accent inference model.

Optionally, the obtaining unit includes:

and the acquisition subunit is used for acquiring the target accent information and the text information corresponding to the target accent information from the pre-generated accent map.

Optionally, the output unit includes:

the uploading subunit is used for uploading the voice information to the server when the first matching degree and the second matching degree are both lower than a preset value, so that the server can match the optimal accent inference model according to the voice information;

the downloading subunit is used for acquiring the optimal accent inference model from the server;

and the recognition subunit is used for recognizing the voice information by the optimal accent inference model to obtain the keywords of the voice information.

Optionally, the second obtaining unit includes:

or the like, or, alternatively,

EXAMPLE III

Referring to fig. 3, fig. 3 is a schematic structural diagram of a speech recognition device according to a third embodiment of the present application.

As shown in fig. 3, the present application provides a speech recognition apparatus including:

a processor 31, and a memory 32 connected to the processor;

the memory is used for storing a computer program for performing at least the following speech recognition method:

acquiring voice information;

sending the voice information to a target voice recognition model obtained by pre-training; the target voice recognition model is obtained by training according to the accent information corresponding to the geographic position of the equipment;

and recognizing the voice information by the target voice recognition model to obtain the keywords of the voice information.

recognizing the voice information by the target voice recognition model to obtain the keywords of the voice information, wherein the keywords comprise:

Optionally, the training process of the target accent inference model includes:

Optionally, the obtaining of the target accent information and the text information corresponding to the target accent information includes:

and acquiring target accent information and text information corresponding to the target accent information from a pre-generated accent map.

Optionally, outputting a keyword corresponding to a higher matching degree of the first matching degree and the second matching degree, including:

acquiring an optimal accent inference model from a server;

sending the voice information to the optimal accent inference model;

Optionally, the step of generating an accent map includes:

the standard accent information is sent to a standard accent inference model obtained through pre-training, and text information of the standard accent information is obtained;

acquiring a target geographic position;

Optionally, the obtaining the target geographic location includes:

acquiring a target geographical position of equipment;

or the like, or, alternatively,

and acquiring the target geographic position input by the user.

and acquiring the geographic position of the connected adjacent mobile terminal as a target geographic position.

The processor is used to call and execute the computer program in the memory.

In addition, an embodiment four of the present application provides an air conditioner, and the air conditioner in the present embodiment includes the voice recognition device as in the third embodiment.

A fifth embodiment of the present application provides a storage medium, where the storage medium stores a computer program, and when the computer program is executed by a processor, the steps in the voice recognition method according to the first embodiment are implemented.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

It is understood that the same or similar parts in the above embodiments may be mutually referred to, and the same or similar parts in other embodiments may be referred to for the content which is not described in detail in some embodiments.

It should be noted that, in the description of the present application, the terms "first", "second", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Further, in the description of the present application, the meaning of "a plurality" means at least two unless otherwise specified.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and the scope of the preferred embodiments of the present application includes other implementations in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware that is related to instructions of a program, and the program may be stored in a computer-readable storage medium, and when executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.

In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims

1. A speech recognition method, comprising:

acquiring voice information; the voice information comprises an accent;

2. The method of claim 1, wherein the target speech recognition model comprises a standard accent inference model and a target accent inference model;

3. The method according to claim 2, wherein the outputting the text information corresponding to the higher matching degree of the first matching degree and the second matching degree comprises:

obtaining the optimal accent inference model from the server;

sending the voice information to the optimal accent inference model;

4. The method of claim 2, further comprising:

acquiring a target geographic position;

5. The method of claim 4, wherein the training process of the target accent inference model comprises:

6. The method of claim 4, wherein the obtaining the target geographic location comprises:

acquiring a target geographical position of equipment;

or the like, or, alternatively,

and acquiring the target geographic position input by the user.

7. The method of claim 6, wherein obtaining the target geographic location of the device comprises:

8. A speech recognition apparatus, comprising:

9. The apparatus of claim 8, wherein the target speech recognition model comprises a standard accent inference model and a target accent inference model;

the identification module comprises:

10. The apparatus of claim 9, wherein the output unit comprises:

11. The apparatus of claim 9, further comprising an accent map generation module, the accent map generation module comprising:

12. The apparatus of claim 11, further comprising a training module of a target accent inference model, the training module of the target accent inference model comprising:

13. The apparatus of claim 11, wherein the second obtaining unit comprises:

or the like, or, alternatively,

14. The apparatus according to claim 13, wherein the first obtaining subunit is specifically configured to obtain, as the target geographic location, a geographic location of a neighboring mobile terminal connected to the device.

15. An apparatus for speech recognition, comprising:

a processor, and a memory coupled to the processor;

the memory is adapted to store a computer program for performing at least the speech recognition method of any of claims 1-7;

16. An air conditioner characterized by comprising the apparatus for speech recognition according to claim 15.

17. A storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the steps of the speech recognition method according to any one of claims 1-7.