CN111161718A - Voice recognition method, device, equipment, storage medium and air conditioner - Google Patents

Voice recognition method, device, equipment, storage medium and air conditioner Download PDF

Info

Publication number
CN111161718A
CN111161718A CN201811323620.3A CN201811323620A CN111161718A CN 111161718 A CN111161718 A CN 111161718A CN 201811323620 A CN201811323620 A CN 201811323620A CN 111161718 A CN111161718 A CN 111161718A
Authority
CN
China
Prior art keywords
accent
target
information
voice information
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811323620.3A
Other languages
Chinese (zh)
Inventor
刘文峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Gree Electric Appliances Inc of Zhuhai
Original Assignee
Gree Electric Appliances Inc of Zhuhai
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Gree Electric Appliances Inc of Zhuhai filed Critical Gree Electric Appliances Inc of Zhuhai
Priority to CN201811323620.3A priority Critical patent/CN111161718A/en
Publication of CN111161718A publication Critical patent/CN111161718A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/226Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics

Abstract

The application relates to a voice recognition method, a voice recognition device, equipment, a storage medium and an air conditioner, wherein the voice recognition device comprises the following steps: acquiring voice information; sending the voice information to a target voice recognition model obtained by pre-training; the accent identified by the target speech identification model is matched with the target accent, and the target accent is the accent corresponding to the geographical position of the equipment; and recognizing the voice information by the target voice recognition model to obtain text information of the voice information. Because the target voice model is matched with the accent used in the geographic position, the target voice model has a high recognition rate of the accent, and on the basis, the technical scheme of the application has a relatively ideal recognition rate of the accent.

Description

Voice recognition method, device, equipment, storage medium and air conditioner
Technical Field
The application relates to the technical field of human-computer interaction, in particular to a voice recognition method, a voice recognition device, voice recognition equipment, a storage medium and an air conditioner.
Background
With the development of science and technology, the interaction modes between people and machines are more and more diversified, wherein the machines which are widely applied at present carry out human-computer interaction by recognizing human voices.
Since a language is composed of a large number of dialects, and the accents of each person are different for a single dialect, it is difficult for the conventional speech recognition technology to achieve an ideal recognition rate when recognizing an accented speech, particularly in a remote area where dialects are complicated.
Disclosure of Invention
To overcome at least some of the problems of the related art, the present application provides a voice recognition method, apparatus, device, storage medium, and air conditioner.
According to a first aspect of the present application, there is provided a speech recognition method comprising:
acquiring voice information; the voice information comprises an accent;
sending the voice information to a preset target voice recognition model; the accent identified by the target speech recognition model is matched with a target accent, and the target accent is an accent used by the geographical position of the equipment;
and recognizing the voice information by the target voice recognition model to obtain text information of the voice information.
Optionally, the target speech recognition model comprises a standard accent inference model and a target accent inference model;
the recognizing the voice information by the target voice recognition model to obtain the text information of the voice information comprises the following steps:
sending the voice information to a standard accent inference model obtained by pre-training and a target accent inference model obtained by pre-training, and respectively obtaining first text information and second text information of the voice information correspondingly; the accent identified by the target accent inference model matches a target accent, the target accent being an accent used by the geographic location of the device;
respectively determining a first matching degree of the first text information and the voice information and a second matching degree of the second text information and the voice information;
and outputting text information corresponding to the matching degree with higher matching degree in the first matching degree and the second matching degree.
Optionally, the outputting text information corresponding to a higher matching degree of the first matching degree and the second matching degree includes:
when the first matching degree and the second matching degree are lower than a preset value, uploading the voice information to a server so that the server can be matched with an optimal accent inference model according to the voice information;
obtaining the optimal accent inference model from the server;
sending the voice information to the optimal accent inference model;
and recognizing the voice information by the optimal accent inference model to obtain text information of the voice information.
Optionally, the method further includes:
receiving standard accent information and target accent information of the same keyword sent by a user;
sending the standard accent information to a standard accent inference model obtained by pre-training to obtain text information of the standard accent information;
setting the text information of the standard accent information as the text information corresponding to the target accent information;
acquiring a target geographic position;
and comparing and clustering phonemes of the dialect accent voice information with the same keywords by a clustering algorithm, and forming a dialect boundary according to the target geographic position corresponding to the dialect accent voice information so as to form an accent map.
Optionally, the training process of the target accent inference model includes:
acquiring the target accent information and the text information corresponding to the target accent information;
and training a pre-established deep learning model by taking the target accent information and the text information corresponding to the target accent information as training samples to obtain the target accent inference model.
Optionally, the obtaining the target geographic location includes:
acquiring a target geographical position of equipment;
or the like, or, alternatively,
and acquiring the target geographic position input by the user.
Optionally, the obtaining of the target geographic location of the device includes:
and acquiring the geographical position of the adjacent mobile terminal connected with the equipment as the target geographical position.
According to a second aspect of the present application, there is provided a speech recognition apparatus comprising:
the acquisition module is used for acquiring voice information; the voice information comprises an accent;
the transmitting module is used for matching the accent identified by the target speech recognition model with a target accent, wherein the target accent is the accent used by the geographical position of the equipment;
and the recognition module is used for recognizing the voice information by the target voice recognition model to obtain text information of the voice information.
Optionally, the target speech recognition model comprises a standard accent inference model and a target accent inference model;
the identification module comprises:
the recognition unit is used for sending the voice information to a standard accent inference model obtained by pre-training and a target accent inference model obtained by pre-training, and respectively obtaining a first keyword and a second keyword of the voice information correspondingly; matching the accent identified by the target speech recognition model with a target accent, wherein the target accent is an accent used by the geographical position of the equipment;
the matching unit is used for respectively determining a first matching degree of the first text information and the voice information and a second matching degree of the second text information and the voice information;
and the output unit is used for outputting the text information corresponding to the matching degree with higher matching degree in the first matching degree and the second matching degree.
Optionally, the output unit includes:
the uploading subunit is used for uploading the voice information to a server when the first matching degree and the second matching degree are both lower than a preset value, so that the server matches an optimal accent inference model according to the voice information;
a downloading subunit, configured to obtain the optimal accent inference model from the server;
a sending subunit, configured to send the voice information to the optimal accent inference model;
and the recognition subunit is used for recognizing the voice information by the optimal accent inference model to obtain text information of the voice information.
Optionally, the method further includes an accent map generation module, where the accent map generation module includes:
the receiving unit is used for receiving standard accent information and target accent information of the same keyword sent by a user;
the recognition unit is used for sending the standard accent voice information to a standard accent inference model obtained by pre-training to obtain text information of the standard accent information;
the setting unit is used for setting the text information of the standard accent information as the text information corresponding to the target accent information;
the second acquisition unit is used for acquiring the target geographic position;
and the generating unit is used for comparing and clustering phonemes of the dialect accent voice information with the same keywords through a clustering algorithm, and forming a dialect boundary according to the target geographic position corresponding to the dialect accent voice information so as to form an accent map.
Optionally, the method further includes a training module of the target accent inference model, where the training module of the target accent inference model includes:
the device comprises a first acquisition unit, a second acquisition unit and a processing unit, wherein the first acquisition unit is used for acquiring target accent information and text information corresponding to the target accent information;
and the training unit is used for training a pre-established deep learning model by taking the target accent information and the text information corresponding to the target accent information as training samples to obtain the target accent inference model.
Optionally, the second obtaining unit includes:
the first acquisition subunit is used for acquiring the target geographic position of the equipment;
or the like, or, alternatively,
and the second acquisition subunit is used for acquiring the target geographic position input by the user.
Optionally, the first obtaining subunit is specifically configured to obtain a geographic location of a neighboring mobile terminal connected to the device as the target geographic location.
According to a third aspect of the present application, there is provided an apparatus for speech recognition, comprising:
a processor, and a memory coupled to the processor;
the memory is for storing a computer program for performing at least a speech recognition method as follows:
acquiring voice information; the voice information comprises an accent;
sending the voice information to a preset target voice recognition model; the accent identified by the target speech recognition model is matched with a target accent, and the target accent is an accent used by the geographical position of the equipment;
and recognizing the voice information by the target voice recognition model to obtain text information of the voice information.
Optionally, the target speech recognition model comprises a standard accent inference model and a target accent inference model;
the recognizing the voice information by the target voice recognition model to obtain the text information of the voice information comprises the following steps:
sending the voice information to a standard accent inference model obtained by pre-training and a target accent inference model obtained by pre-training, and respectively obtaining first text information and second text information of the voice information correspondingly; the accent identified by the target accent inference model matches a target accent, the target accent being an accent used by the geographic location of the device;
respectively determining a first matching degree of the first text information and the voice information and a second matching degree of the second text information and the voice information;
and outputting text information corresponding to the matching degree with higher matching degree in the first matching degree and the second matching degree.
Optionally, the outputting text information corresponding to a higher matching degree of the first matching degree and the second matching degree includes:
when the first matching degree and the second matching degree are lower than a preset value, uploading the voice information to a server so that the server can be matched with an optimal accent inference model according to the voice information;
obtaining the optimal accent inference model from the server;
sending the voice information to the optimal accent inference model;
and recognizing the voice information by the optimal accent inference model to obtain text information of the voice information.
Optionally, the method further includes:
receiving standard accent information and target accent information of the same keyword sent by a user;
sending the standard accent information to a standard accent inference model obtained by pre-training to obtain text information of the standard accent information;
setting the text information of the standard accent information as the text information corresponding to the target accent information;
acquiring a target geographic position;
and comparing and clustering phonemes of the dialect accent voice information with the same keywords by a clustering algorithm, and forming a dialect boundary according to the target geographic position corresponding to the dialect accent voice information so as to form an accent map.
Optionally, the training process of the target accent inference model includes:
acquiring the target accent information and the text information corresponding to the target accent information;
and training a pre-established deep learning model by taking the target accent information and the text information corresponding to the target accent information as training samples to obtain the target accent inference model.
Optionally, the obtaining the target geographic location includes:
acquiring a target geographical position of equipment;
or the like, or, alternatively,
and acquiring the target geographic position input by the user.
Optionally, the obtaining of the target geographic location of the device includes:
and acquiring the geographical position of the adjacent mobile terminal connected with the equipment as the target geographical position.
The processor is used for calling and executing the computer program in the memory.
According to a fourth aspect of the present application, there is provided an air conditioner comprising a speech recognition device as described in the third aspect of the present application.
According to a fifth aspect of the present application, there is provided a storage medium storing a computer program which, when executed by a processor, implements the speech recognition method according to the first aspect of the present application.
The technical scheme provided by the application can comprise the following beneficial effects: after the voice information is obtained, the voice information is sent to a target voice recognition model obtained through pre-training, then the voice information is recognized through the target voice recognition model, text information of the voice information is obtained, the voice information comprises accents, wherein the accents recognized through the target voice recognition model are matched with the accents used at the geographical position where the equipment is located, therefore, when the dialect accents in the geographical position range where the equipment is located are used for man-machine interaction, the target voice model is matched with the accents used at the geographical position, the recognition rate of the dialect accents through the target voice model is high, and accordingly, the technical scheme has a relatively ideal recognition rate of the dialect accents.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.
Fig. 1 is a flowchart illustrating a speech recognition method according to an embodiment of the present application.
Fig. 2 is a schematic structural diagram of a speech recognition apparatus according to a second embodiment of the present application.
Fig. 3 is a schematic structural diagram of a speech recognition device according to a third embodiment of the present application.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.
Example one
Referring to fig. 1, fig. 1 is a schematic flowchart illustrating a speech recognition method according to an embodiment of the present application.
As shown in fig. 1, the speech recognition method provided in this embodiment includes:
step 11, acquiring voice information; the voice information includes an accent;
step 12, sending the voice information to a target voice recognition model obtained by pre-training; matching the accent identified by the target speech recognition model with the target accent, wherein the target accent is the accent used by the geographical position of the equipment;
and step 13, recognizing the voice information by the target voice recognition model to obtain text information of the voice information.
After the voice information is obtained, the voice information is sent to a target voice recognition model obtained through pre-training, and then the target voice recognition model recognizes the voice information to obtain the keywords of the voice information. The accent recognized by the target voice recognition model is matched with the target accent, and the target accent is an accent used by the geographical position of the equipment, so that when the dialect accent in the geographical position range of the equipment is used for man-machine interaction, the target voice model is matched with the accent used by the geographical position, and therefore the recognition rate of the target voice model to the dialect accent is high.
The text information may be computer language text information, or text information or keywords in any language.
The target speech recognition model may include a standard accent inference model and a target accent inference model. The standard accent inference model is a speech recognition model trained for the sample based on speech information of Mandarin, and the target accent inference model is a speech recognition model trained for the sample based on speech information of dialect accent.
In step 13, the process of recognizing the speech information by the target speech recognition model to obtain the keyword may include the following steps:
sending the voice information to a standard accent inference model obtained by pre-training and a target accent inference model obtained by pre-training, and respectively obtaining a first keyword and a second keyword of the voice information correspondingly; matching the accent identified by the target speech recognition model with the target accent, wherein the target accent is the accent used by the geographical position of the equipment;
respectively determining a first matching degree of the first text information and the voice information and a second matching degree of the second text information and the voice information;
and outputting the keywords corresponding to the matching degree with higher matching degree in the first matching degree and the second matching degree.
The following describes a process of recognizing voice information by the target voice recognition model to obtain a keyword, taking "turn on air conditioner" voice information of an accent used in a geographical location of a device as an example.
After receiving the speech information, the pre-trained standard accent inference model recognizes the first keyword, and since the speech information in this example is an accent used by the geographical location of the device, which is different from mandarin, the matching degree between the first keyword and the speech information may be 60% in the recognition process, which is referred to as a first matching degree here.
After receiving the voice information, the pre-trained target accent inference model obtains a second keyword through recognition, and since the target accent inference model is matched with the accent used by the geographic position of the device, in the recognition process, the matching degree of the second keyword and the voice information may be 90%, and the matching degree is recorded as a second matching degree.
As can be seen from the above, since the second matching degree is higher than the first matching degree, the second keyword corresponding to the second matching degree is output as the last keyword.
Due to the popularization of Mandarin, the situation of using Mandarin in families is more and more common, so that the speech information is simultaneously recognized by using the standard accent inference model and the target accent inference model, the keywords with high matching degree are output, and the recognition speed can be effectively accelerated.
Further, the step of outputting the keyword corresponding to the higher matching degree of the first matching degree and the second matching degree may include:
when the first matching degree and the second matching degree are lower than a preset value, uploading the voice information to a server so that the server matches the optimal accent inference model according to the voice information;
acquiring an optimal accent inference model from a server;
sending the voice information to the optimal accent inference model;
and recognizing the voice information by the optimal accent inference model to obtain the keywords of the voice information.
When the first matching degree and the second matching degree are both lower than a preset value, the standard accent inference model and the target accent inference model are not the best matching accent inference model, at the moment, the voice information can be uploaded to a server, the server is matched with the best accent inference model, and then the voice information is sent to the best accent inference model; and recognizing the voice information by the optimal accent inference model to obtain the keywords of the voice information.
In the present family, dialect accents from different places are often used simultaneously, so that the target accent inference model matched with the accent used by the geographical position of the equipment is not suitable for the dialect accents from different places any more, and the problem can be effectively solved by adopting the steps.
The above models are obtained by training in advance, and the following describes training of the models by taking the training process of the target accent estimation model as an example.
The training process of the target accent inference model may be:
acquiring target accent information and text information corresponding to the target accent information;
and training a pre-established deep learning model by taking the target accent information and the text information corresponding to the target accent information as training samples to obtain a target accent inference model.
The target accent information and the text information corresponding to the target accent information may be obtained from a pre-generated accent map or may be obtained in the formation process of the accent map, the target accent information may be speech information of dialect accents at a certain geographic location, and the text information may indicate keywords or contents of the target speech information. The process of establishing the deep learning model is the prior art, and the detailed content is not repeated in the application.
In addition, the embodiment may further include a step of generating an accent map, where the step may include:
receiving standard accent information and target accent information of the same keyword sent by a user;
sending the standard accent information to a standard accent inference model obtained by pre-training to obtain text information of the standard accent information;
setting the text information of the standard accent information as the text information corresponding to the target accent information;
acquiring a target geographic position;
and comparing and clustering phonemes of the dialect accent voice information with the same keywords by a clustering algorithm, and forming a dialect boundary according to the target geographic position corresponding to the dialect accent voice information so as to form an accent map.
It should be noted that the user who sends out the standard accent information and the target accent information of the same keyword may be a user who grasps multiple dialects or languages, and the user sends out the standard accent information and the target accent information of the same keyword in a voluntary manner, so that huge manpower is not spent to actively collect the accent information, and manpower and material resources for collecting the accent information are saved.
Furthermore, the target geographic position may be obtained in various manners, and the target geographic position where the device is located may be directly obtained, or the target geographic position input by the user may also be obtained. When the user is a non-local resident, the geographic position corresponding to the dialect accent of the user is often different from the geographic position of the device, and at the moment, the user can directly input the place of the dialect as the target geographic position, so that the accuracy of data in the accent map can be improved.
In addition, since some devices do not have the function of positioning, the target geographic location can be determined by the geographic location of the neighboring mobile device to which the device is connected.
In addition, when the steps of the method are performed for the first time, the target speech recognition model preset in step 12 may be obtained by the device from the server according to the geographical location where the device is located, or may be obtained by the user directly sending a request to the server through the device.
The accent map may include a dialect boundary and a target accent inference model corresponding to an area included after the dialect boundary is formed.
It should be noted that, in order to ensure the accuracy of voice clustering, the accuracy of pronunciation and dialect adaptation for dialect by telephone, mail and other manual means is allowed to complain to the manager, and the incorrect dialect voice is eliminated by the manager by means of manual intervention.
Example two
Referring to fig. 2, fig. 2 is a schematic structural diagram of a speech recognition device according to a second embodiment of the present application.
As shown in fig. 2, the speech recognition apparatus provided in this embodiment includes:
an obtaining module 21, configured to obtain voice information; the voice information includes accent information;
a sending module 22, configured to send the voice information to a target voice recognition model obtained through pre-training; the target voice recognition model is obtained by training according to the accent information corresponding to the geographic position of the equipment;
and the recognition module 23 is configured to recognize the voice information by the target voice recognition model to obtain a keyword of the voice information.
Optionally, the target speech recognition model includes a standard accent inference model and a target accent inference model;
the identification module comprises:
the recognition unit is used for sending the voice information to a standard accent inference model obtained by pre-training and a target accent inference model obtained by pre-training, and respectively obtaining a first keyword and a second keyword of the voice information correspondingly; matching the accent identified by the target speech recognition model with the target accent, wherein the target accent is the accent used by the geographical position of the equipment;
the matching unit is used for respectively determining a first matching degree of the first text information and the voice information and a second matching degree of the second text information and the voice information;
and the output unit is used for outputting the text information corresponding to the matching degree with higher matching degree in the first matching degree and the second matching degree.
Optionally, the method further includes a training module of the target accent inference model, where the training module of the target accent inference model includes:
the first acquisition unit is used for acquiring target accent information and text information corresponding to the target accent information;
and the training unit is used for training a pre-established deep learning model by taking the target accent information and the text information corresponding to the target accent information as training samples to obtain a target accent inference model.
Optionally, the obtaining unit includes:
and the acquisition subunit is used for acquiring the target accent information and the text information corresponding to the target accent information from the pre-generated accent map.
Optionally, the output unit includes:
the uploading subunit is used for uploading the voice information to the server when the first matching degree and the second matching degree are both lower than a preset value, so that the server can match the optimal accent inference model according to the voice information;
the downloading subunit is used for acquiring the optimal accent inference model from the server;
a sending subunit, configured to send the voice information to the optimal accent inference model;
and the recognition subunit is used for recognizing the voice information by the optimal accent inference model to obtain the keywords of the voice information.
Optionally, the method further includes an accent map generation module, where the accent map generation module includes:
the receiving unit is used for receiving standard accent information and target accent information of the same keyword sent by a user;
the recognition unit is used for sending the standard accent voice information to a standard accent inference model obtained by pre-training to obtain text information of the standard accent information;
the setting unit is used for setting the text information of the standard accent information as the text information corresponding to the target accent information;
the second acquisition unit is used for acquiring the target geographic position;
and the generating unit is used for comparing and clustering phonemes of the dialect accent voice information with the same keywords through a clustering algorithm, and forming a dialect boundary according to the target geographic position corresponding to the dialect accent voice information so as to form an accent map.
Optionally, the second obtaining unit includes:
the first acquisition subunit is used for acquiring the target geographic position of the equipment;
or the like, or, alternatively,
and the second acquisition subunit is used for acquiring the target geographic position input by the user.
Optionally, the first obtaining subunit is specifically configured to obtain a geographic location of a neighboring mobile terminal connected to the device as the target geographic location.
EXAMPLE III
Referring to fig. 3, fig. 3 is a schematic structural diagram of a speech recognition device according to a third embodiment of the present application.
As shown in fig. 3, the present application provides a speech recognition apparatus including:
a processor 31, and a memory 32 connected to the processor;
the memory is used for storing a computer program for performing at least the following speech recognition method:
acquiring voice information;
sending the voice information to a target voice recognition model obtained by pre-training; the target voice recognition model is obtained by training according to the accent information corresponding to the geographic position of the equipment;
and recognizing the voice information by the target voice recognition model to obtain the keywords of the voice information.
Optionally, the target speech recognition model includes a standard accent inference model and a target accent inference model;
recognizing the voice information by the target voice recognition model to obtain the keywords of the voice information, wherein the keywords comprise:
sending the voice information to a standard accent inference model obtained by pre-training and a target accent inference model obtained by pre-training, and respectively obtaining a first keyword and a second keyword of the voice information correspondingly; matching the accent identified by the target speech recognition model with the target accent, wherein the target accent is the accent used by the geographical position of the equipment;
respectively determining a first matching degree of the first text information and the voice information and a second matching degree of the second text information and the voice information;
and outputting the keywords corresponding to the matching degree with higher matching degree in the first matching degree and the second matching degree.
Optionally, the training process of the target accent inference model includes:
acquiring target accent information and text information corresponding to the target accent information;
and training a pre-established deep learning model by taking the target accent information and the text information corresponding to the target accent information as training samples to obtain a target accent inference model.
Optionally, the obtaining of the target accent information and the text information corresponding to the target accent information includes:
and acquiring target accent information and text information corresponding to the target accent information from a pre-generated accent map.
Optionally, outputting a keyword corresponding to a higher matching degree of the first matching degree and the second matching degree, including:
when the first matching degree and the second matching degree are lower than a preset value, uploading the voice information to a server so that the server matches the optimal accent inference model according to the voice information;
acquiring an optimal accent inference model from a server;
sending the voice information to the optimal accent inference model;
and recognizing the voice information by the optimal accent inference model to obtain the keywords of the voice information.
Optionally, the step of generating an accent map includes:
receiving standard accent information and target accent information of the same keyword sent by a user;
the standard accent information is sent to a standard accent inference model obtained through pre-training, and text information of the standard accent information is obtained;
setting the text information of the standard accent information as the text information corresponding to the target accent information;
acquiring a target geographic position;
and comparing and clustering phonemes of the dialect accent voice information with the same keywords by a clustering algorithm, and forming a dialect boundary according to the target geographic position corresponding to the dialect accent voice information so as to form an accent map.
Optionally, the obtaining the target geographic location includes:
acquiring a target geographical position of equipment;
or the like, or, alternatively,
and acquiring the target geographic position input by the user.
Optionally, the obtaining of the target geographic location of the device includes:
and acquiring the geographic position of the connected adjacent mobile terminal as a target geographic position.
The processor is used to call and execute the computer program in the memory.
In addition, an embodiment four of the present application provides an air conditioner, and the air conditioner in the present embodiment includes the voice recognition device as in the third embodiment.
A fifth embodiment of the present application provides a storage medium, where the storage medium stores a computer program, and when the computer program is executed by a processor, the steps in the voice recognition method according to the first embodiment are implemented.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
It is understood that the same or similar parts in the above embodiments may be mutually referred to, and the same or similar parts in other embodiments may be referred to for the content which is not described in detail in some embodiments.
It should be noted that, in the description of the present application, the terms "first", "second", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Further, in the description of the present application, the meaning of "a plurality" means at least two unless otherwise specified.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and the scope of the preferred embodiments of the present application includes other implementations in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.
It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware that is related to instructions of a program, and the program may be stored in a computer-readable storage medium, and when executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.
In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims (17)

1. A speech recognition method, comprising:
acquiring voice information; the voice information comprises an accent;
sending the voice information to a preset target voice recognition model; the accent identified by the target speech recognition model is matched with a target accent, and the target accent is an accent used by the geographical position of the equipment;
and recognizing the voice information by the target voice recognition model to obtain text information of the voice information.
2. The method of claim 1, wherein the target speech recognition model comprises a standard accent inference model and a target accent inference model;
the recognizing the voice information by the target voice recognition model to obtain the text information of the voice information comprises the following steps:
sending the voice information to a standard accent inference model obtained by pre-training and a target accent inference model obtained by pre-training, and respectively obtaining first text information and second text information of the voice information correspondingly; the accent identified by the target accent inference model matches a target accent, the target accent being an accent used by the geographic location of the device;
respectively determining a first matching degree of the first text information and the voice information and a second matching degree of the second text information and the voice information;
and outputting text information corresponding to the matching degree with higher matching degree in the first matching degree and the second matching degree.
3. The method according to claim 2, wherein the outputting the text information corresponding to the higher matching degree of the first matching degree and the second matching degree comprises:
when the first matching degree and the second matching degree are lower than a preset value, uploading the voice information to a server so that the server can be matched with an optimal accent inference model according to the voice information;
obtaining the optimal accent inference model from the server;
sending the voice information to the optimal accent inference model;
and recognizing the voice information by the optimal accent inference model to obtain text information of the voice information.
4. The method of claim 2, further comprising:
receiving standard accent information and target accent information of the same keyword sent by a user;
sending the standard accent information to a standard accent inference model obtained by pre-training to obtain text information of the standard accent information;
setting the text information of the standard accent information as the text information corresponding to the target accent information;
acquiring a target geographic position;
and comparing and clustering phonemes of the dialect accent voice information with the same keywords by a clustering algorithm, and forming a dialect boundary according to the target geographic position corresponding to the dialect accent voice information so as to form an accent map.
5. The method of claim 4, wherein the training process of the target accent inference model comprises:
acquiring the target accent information and the text information corresponding to the target accent information;
and training a pre-established deep learning model by taking the target accent information and the text information corresponding to the target accent information as training samples to obtain the target accent inference model.
6. The method of claim 4, wherein the obtaining the target geographic location comprises:
acquiring a target geographical position of equipment;
or the like, or, alternatively,
and acquiring the target geographic position input by the user.
7. The method of claim 6, wherein obtaining the target geographic location of the device comprises:
and acquiring the geographical position of the adjacent mobile terminal connected with the equipment as the target geographical position.
8. A speech recognition apparatus, comprising:
the acquisition module is used for acquiring voice information; the voice information comprises an accent;
the transmitting module is used for matching the accent identified by the target speech recognition model with a target accent, wherein the target accent is the accent used by the geographical position of the equipment;
and the recognition module is used for recognizing the voice information by the target voice recognition model to obtain text information of the voice information.
9. The apparatus of claim 8, wherein the target speech recognition model comprises a standard accent inference model and a target accent inference model;
the identification module comprises:
the recognition unit is used for sending the voice information to a standard accent inference model obtained by pre-training and a target accent inference model obtained by pre-training, and respectively obtaining a first keyword and a second keyword of the voice information correspondingly; matching the accent identified by the target speech recognition model with a target accent, wherein the target accent is an accent used by the geographical position of the equipment;
the matching unit is used for respectively determining a first matching degree of the first text information and the voice information and a second matching degree of the second text information and the voice information;
and the output unit is used for outputting the text information corresponding to the matching degree with higher matching degree in the first matching degree and the second matching degree.
10. The apparatus of claim 9, wherein the output unit comprises:
the uploading subunit is used for uploading the voice information to a server when the first matching degree and the second matching degree are both lower than a preset value, so that the server matches an optimal accent inference model according to the voice information;
a downloading subunit, configured to obtain the optimal accent inference model from the server;
a sending subunit, configured to send the voice information to the optimal accent inference model;
and the recognition subunit is used for recognizing the voice information by the optimal accent inference model to obtain text information of the voice information.
11. The apparatus of claim 9, further comprising an accent map generation module, the accent map generation module comprising:
the receiving unit is used for receiving standard accent information and target accent information of the same keyword sent by a user;
the recognition unit is used for sending the standard accent voice information to a standard accent inference model obtained by pre-training to obtain text information of the standard accent information;
the setting unit is used for setting the text information of the standard accent information as the text information corresponding to the target accent information;
the second acquisition unit is used for acquiring the target geographic position;
and the generating unit is used for comparing and clustering phonemes of the dialect accent voice information with the same keywords through a clustering algorithm, and forming a dialect boundary according to the target geographic position corresponding to the dialect accent voice information so as to form an accent map.
12. The apparatus of claim 11, further comprising a training module of a target accent inference model, the training module of the target accent inference model comprising:
the device comprises a first acquisition unit, a second acquisition unit and a processing unit, wherein the first acquisition unit is used for acquiring target accent information and text information corresponding to the target accent information;
and the training unit is used for training a pre-established deep learning model by taking the target accent information and the text information corresponding to the target accent information as training samples to obtain the target accent inference model.
13. The apparatus of claim 11, wherein the second obtaining unit comprises:
the first acquisition subunit is used for acquiring the target geographic position of the equipment;
or the like, or, alternatively,
and the second acquisition subunit is used for acquiring the target geographic position input by the user.
14. The apparatus according to claim 13, wherein the first obtaining subunit is specifically configured to obtain, as the target geographic location, a geographic location of a neighboring mobile terminal connected to the device.
15. An apparatus for speech recognition, comprising:
a processor, and a memory coupled to the processor;
the memory is adapted to store a computer program for performing at least the speech recognition method of any of claims 1-7;
the processor is used for calling and executing the computer program in the memory.
16. An air conditioner characterized by comprising the apparatus for speech recognition according to claim 15.
17. A storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the steps of the speech recognition method according to any one of claims 1-7.
CN201811323620.3A 2018-11-07 2018-11-07 Voice recognition method, device, equipment, storage medium and air conditioner Pending CN111161718A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811323620.3A CN111161718A (en) 2018-11-07 2018-11-07 Voice recognition method, device, equipment, storage medium and air conditioner

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811323620.3A CN111161718A (en) 2018-11-07 2018-11-07 Voice recognition method, device, equipment, storage medium and air conditioner

Publications (1)

Publication Number Publication Date
CN111161718A true CN111161718A (en) 2020-05-15

Family

ID=70554794

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811323620.3A Pending CN111161718A (en) 2018-11-07 2018-11-07 Voice recognition method, device, equipment, storage medium and air conditioner

Country Status (1)

Country Link
CN (1) CN111161718A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111933107A (en) * 2020-09-04 2020-11-13 珠海格力电器股份有限公司 Speech recognition method, speech recognition device, storage medium and processor
CN116386603A (en) * 2023-06-01 2023-07-04 蔚来汽车科技(安徽)有限公司 Speech recognition method, device, driving device and medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140236600A1 (en) * 2013-01-29 2014-08-21 Tencent Technology (Shenzhen) Company Limited Method and device for keyword detection
CN104008132A (en) * 2014-05-04 2014-08-27 深圳市北科瑞声科技有限公司 Voice map searching method and system
CN104391673A (en) * 2014-11-20 2015-03-04 百度在线网络技术(北京)有限公司 Voice interaction method and voice interaction device
CN106128462A (en) * 2016-06-21 2016-11-16 东莞酷派软件技术有限公司 Audio recognition method and system
CN106251865A (en) * 2016-08-04 2016-12-21 华东师范大学 A kind of medical treatment & health record Auto-writing method based on speech recognition
CN107564525A (en) * 2017-10-23 2018-01-09 深圳北鱼信息科技有限公司 Audio recognition method and device
CN108389577A (en) * 2018-02-12 2018-08-10 广州视源电子科技股份有限公司 Optimize method, system, equipment and the storage medium of voice recognition acoustic model
CN108711421A (en) * 2017-04-10 2018-10-26 北京猎户星空科技有限公司 A kind of voice recognition acoustic model method for building up and device and electronic equipment

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140236600A1 (en) * 2013-01-29 2014-08-21 Tencent Technology (Shenzhen) Company Limited Method and device for keyword detection
CN104008132A (en) * 2014-05-04 2014-08-27 深圳市北科瑞声科技有限公司 Voice map searching method and system
CN104391673A (en) * 2014-11-20 2015-03-04 百度在线网络技术(北京)有限公司 Voice interaction method and voice interaction device
CN106128462A (en) * 2016-06-21 2016-11-16 东莞酷派软件技术有限公司 Audio recognition method and system
CN106251865A (en) * 2016-08-04 2016-12-21 华东师范大学 A kind of medical treatment & health record Auto-writing method based on speech recognition
CN108711421A (en) * 2017-04-10 2018-10-26 北京猎户星空科技有限公司 A kind of voice recognition acoustic model method for building up and device and electronic equipment
CN107564525A (en) * 2017-10-23 2018-01-09 深圳北鱼信息科技有限公司 Audio recognition method and device
CN108389577A (en) * 2018-02-12 2018-08-10 广州视源电子科技股份有限公司 Optimize method, system, equipment and the storage medium of voice recognition acoustic model

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张策等: "《重庆方言语音识别系统的设计与实现》", 《计算机测量与控制》 *
黄孝平: "《当代机器深度学习方法与应用研究》", 30 November 2017, 电子科技大学出版社 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111933107A (en) * 2020-09-04 2020-11-13 珠海格力电器股份有限公司 Speech recognition method, speech recognition device, storage medium and processor
CN116386603A (en) * 2023-06-01 2023-07-04 蔚来汽车科技(安徽)有限公司 Speech recognition method, device, driving device and medium

Similar Documents

Publication Publication Date Title
US9430467B2 (en) Mobile speech-to-speech interpretation system
CN108962255B (en) Emotion recognition method, emotion recognition device, server and storage medium for voice conversation
CN107644638B (en) Audio recognition method, device, terminal and computer readable storage medium
US10152965B2 (en) Learning personalized entity pronunciations
CN108288467B (en) Voice recognition method and device and voice recognition engine
CN105895103B (en) Voice recognition method and device
CN108986826A (en) Automatically generate method, electronic device and the readable storage medium storing program for executing of minutes
CN110998720A (en) Voice data processing method and electronic device supporting the same
JP7171532B2 (en) Apparatus and method for recognizing speech, apparatus and method for training speech recognition model
CN106875949B (en) Correction method and device for voice recognition
CN108447471A (en) Audio recognition method and speech recognition equipment
CN103635962A (en) Voice recognition system, recognition dictionary logging system, and audio model identifier series generation device
CN111261162B (en) Speech recognition method, speech recognition apparatus, and storage medium
US20220284882A1 (en) Instantaneous Learning in Text-To-Speech During Dialog
CN110544470B (en) Voice recognition method and device, readable storage medium and electronic equipment
CN111986675A (en) Voice conversation method, device and computer readable storage medium
KR102140391B1 (en) Search method and electronic device using the method
TW201911290A (en) System and method for language based service calls
CN109074809B (en) Information processing apparatus, information processing method, and computer-readable storage medium
CN111178081A (en) Semantic recognition method, server, electronic device and computer storage medium
CN111161718A (en) Voice recognition method, device, equipment, storage medium and air conditioner
CN108364655A (en) Method of speech processing, medium, device and computing device
CN108538292A (en) A kind of audio recognition method, device, equipment and readable storage medium storing program for executing
CN110809796B (en) Speech recognition system and method with decoupled wake phrases
CN113724693B (en) Voice judging method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200515

RJ01 Rejection of invention patent application after publication