CN112992127A

CN112992127A - Voice recognition method and device

Info

Publication number: CN112992127A
Application number: CN201911275670.3A
Authority: CN
Inventors: 董勤波; 陈展; 周洪伟
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2019-12-12
Filing date: 2019-12-12
Publication date: 2021-06-18
Anticipated expiration: 2039-12-12
Also published as: CN112992127B

Abstract

The application discloses a voice recognition method, and belongs to the field of voice recognition. The method comprises the following steps: receiving a voice recognition request sent by a terminal, wherein the voice recognition request carries voice data to be recognized and a corresponding first field identifier; determining a domain voice recognition model for recognizing the voice data to be recognized based on the first domain identifier and a corresponding relation between a domain identifier and a domain voice recognition model which are stored in advance; determining result text data corresponding to the voice data to be recognized based on the domain voice recognition model for recognizing the voice data to be recognized; and sending the result text data to the terminal. By the method and the device, the accuracy of voice recognition can be improved.

Description

Voice recognition method and device

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a method and an apparatus for speech recognition.

Background

Speech recognition technology is widely used in people's daily life, for example, to convert audio data into text data through speech recognition.

In the related art, a general terminal may send a voice recognition request to a server, and the server invokes a general-field voice recognition model to recognize voice data carried in the voice recognition request and return a recognition result to the terminal. The generic domain speech recognition model is typically a machine learning model, and then needs to be trained before it can be used. For example, in the case of a speech recognition model in the general field for converting audio data into text data, a skilled person performs training using a large amount of commonly used speech data and text data as samples before using the speech recognition model in the general field. The trained general-field speech recognition model can accurately recognize some commonly used speech data, such as 'hello', 'who you are', and the like.

In the course of implementing the present application, the inventors found that the related art has at least the following problems:

in the related art, because the samples are commonly used speech data and text data when the speech recognition model in the general field is trained, the speech recognition model in the general field is used for recognition of speech data with strong speciality, such as medical vocabularies, communication terms, and the like, and the recognition accuracy is low.

Disclosure of Invention

The embodiment of the application provides a voice recognition method, which can solve the problem of low accuracy of voice recognition. The technical scheme is as follows:

in a first aspect, a method of speech recognition is provided, the method comprising:

receiving a voice recognition request sent by a terminal, wherein the voice recognition request carries voice data to be recognized and a corresponding first field identifier;

determining a domain voice recognition model for recognizing the voice data to be recognized based on the first domain identifier and a corresponding relation between a domain identifier and a domain voice recognition model which are stored in advance;

determining result text data corresponding to the voice data to be recognized based on the domain voice recognition model for recognizing the voice data to be recognized;

and sending the result text data to the terminal.

Optionally, the determining, based on the first domain identifier and a correspondence between a domain identifier and a domain speech recognition model stored in advance, a domain speech recognition model for recognizing the speech data to be recognized includes:

and if a first field speech recognition model corresponding to the first field identifier exists in the corresponding relation between the prestored field identifier and the field speech recognition model, determining the first field speech recognition model as the field speech recognition model for recognizing the speech data to be recognized.

Optionally, the voice recognition request carries an equipment identifier of the terminal, and the method further includes:

determining a first reception time of the voice recognition request;

after the determining the first domain speech recognition model as the domain speech recognition model for recognizing the speech data to be recognized, the method further includes:

if the corresponding relation among the equipment identifier of the terminal, the second field identifier and the second receiving time is stored, updating the second field identifier to the first field identifier, and updating the second receiving time to the first receiving time;

and if the corresponding relation among the equipment identifier of the terminal, the second domain identifier and the second receiving time is not stored, correspondingly storing the equipment identifier, the first domain identifier and the first receiving time.

determining a first reception time of the voice recognition request;

the determining a domain speech recognition model for recognizing the speech data to be recognized based on the first domain identifier and a corresponding relationship between a domain identifier and a domain speech recognition model stored in advance includes:

if the domain voice recognition model corresponding to the first domain identifier does not exist in the corresponding relationship between the domain identifier and the domain voice recognition model which are stored in advance, determining whether second receiving time corresponding to the equipment identifier of the terminal in a preset time length before the first receiving time is stored;

and if the second receiving time exists, determining a second field identification corresponding to the second receiving time, determining a second field speech recognition model corresponding to the second field identification based on the corresponding relation between the prestored field identification and the field speech recognition model, determining the second field speech recognition model as the field speech recognition model for recognizing the speech data to be recognized, and updating the stored second receiving time corresponding to the equipment identification of the terminal as the first receiving time.

Optionally, the method further includes:

if the second receiving time does not exist, inputting the voice data to be recognized into a general field voice recognition model to obtain candidate text data;

inputting the candidate text data into a domain recognition model to obtain a third domain identifier corresponding to the candidate text data and a confidence coefficient of the candidate text data belonging to the domain corresponding to the third domain identifier;

and if the confidence coefficient is greater than a preset threshold value, determining a third field speech recognition model corresponding to the third field identification based on the corresponding relation between the prestored field identification and the field speech recognition model, and determining the third field speech recognition model as the field speech recognition model for recognizing the speech data to be recognized.

Optionally, the method further includes:

and if the confidence coefficient is smaller than a preset threshold value, determining the universal field speech recognition model as the field speech recognition model for recognizing the speech data to be recognized.

Optionally, the method further includes:

using the result text data and the voice data to be recognized corresponding to the result text data as a group of training samples;

correspondingly storing the training samples and the third field identification;

and when the training samples corresponding to the third field identifications reach a preset group number, training the third field voice model corresponding to the third field identifications according to the stored training samples corresponding to the third field identifications.

In a second aspect, a method of speech recognition is provided, the method comprising:

acquiring voice data to be recognized;

determining a target field identifier to which the voice data to be recognized belongs;

sending a voice recognition request to a server, wherein the voice recognition request carries the voice data to be recognized and the target field identification, the voice recognition request is used for indicating the server to determine a field voice recognition model for recognizing the voice data to be recognized based on the target field identification and a corresponding relation between a pre-stored field identification and a field voice recognition model, and result text data corresponding to the voice data to be recognized is determined based on the field voice recognition model for recognizing the voice data to be recognized;

and receiving the result text data sent by the server.

Optionally, the determining the target domain identifier corresponding to the voice data to be recognized includes:

receiving a field selection instruction input by a user;

and determining the domain identifier corresponding to the domain selection instruction as the first domain identifier to which the voice data to be recognized belongs.

and determining a preset domain identification as a first domain identification to which the voice data to be identified belongs.

In a third aspect, an apparatus for speech recognition is provided, the apparatus comprising:

the terminal comprises a receiving module, a processing module and a processing module, wherein the receiving module is used for receiving a voice recognition request sent by the terminal, and the voice recognition request carries voice data to be recognized and a corresponding first field identifier;

the determining module is used for determining a domain voice recognition model for recognizing the voice data to be recognized based on the first domain identification and the corresponding relation between the pre-stored domain identification and the domain voice recognition model;

the recognition module is used for determining result text data corresponding to the voice data to be recognized based on the field voice recognition model for recognizing the voice data to be recognized;

and the sending module is used for sending the result text data to the terminal.

Optionally, the determining module is configured to:

Optionally, the voice recognition request carries an equipment identifier of the terminal, and the receiving module is further configured to:

determining a first reception time of the voice recognition request;

the device further comprises:

a storage module, configured to update a second domain identifier to the first domain identifier and update a second receiving time to the first receiving time if a correspondence between a device identifier of the terminal, the second domain identifier, and the second receiving time is stored; and if the corresponding relation among the equipment identifier of the terminal, the second domain identifier and the second receiving time is not stored, correspondingly storing the equipment identifier, the first domain identifier and the first receiving time.

determining a first reception time of the voice recognition request;

the determining module is configured to:

if the domain voice recognition model corresponding to the first domain identifier does not exist in the corresponding relationship between the domain identifier and the domain voice recognition model which are stored in advance, determining whether second receiving time corresponding to the equipment identifier of the terminal in a preset time length before the first receiving time is stored; and if the second receiving time exists, determining a second field identification corresponding to the second receiving time, determining a second field speech recognition model corresponding to the second field identification based on the corresponding relation between the prestored field identification and the field speech recognition model, determining the second field speech recognition model as the field speech recognition model for recognizing the speech data to be recognized, and updating the stored second receiving time corresponding to the equipment identification of the terminal as the first receiving time.

Optionally, the determining module is further configured to:

Optionally, the apparatus further comprises:

the training module is used for taking the result text data and the voice data to be recognized corresponding to the result text data as a group of training samples; correspondingly storing the training samples and the third field identification;

In a fourth aspect, an apparatus for speech recognition is provided, the apparatus comprising:

the acquisition module is used for acquiring voice data to be recognized;

the determining module is used for determining a target field identifier to which the voice data to be recognized belongs;

a sending module, configured to send a voice recognition request to a server, where the voice recognition request carries the to-be-recognized voice data and the target domain identifier, and the voice recognition request is used to instruct the server to determine a domain voice recognition model for recognizing the to-be-recognized voice data based on the target domain identifier and a correspondence between a domain identifier and a domain voice recognition model that are stored in advance, and determine result text data corresponding to the to-be-recognized voice data based on the domain voice recognition model for recognizing the to-be-recognized voice data;

and the receiving module is used for receiving the result text data sent by the server.

Optionally, the determining module is configured to:

receiving a field selection instruction input by a user;

Optionally, the determining module is configured to:

In a fifth aspect, a server is provided, which comprises a processor and a memory, wherein the memory stores at least one instruction, and the at least one instruction is loaded and executed by the processor to implement the method for speech recognition according to the first aspect.

In a sixth aspect, a terminal is provided, the terminal comprising a processor and a memory, the memory having stored therein at least one instruction, the at least one instruction being loaded and executed by the processor to implement the method of speech recognition according to the second aspect.

In a seventh aspect, a computer-readable storage medium is provided, in which at least one instruction is stored, and the at least one instruction is loaded and executed by a processor to implement the method for speech recognition according to the first and second aspects.

The technical scheme provided by the embodiment of the application has the following beneficial effects:

the corresponding relation between the field identification and the field voice recognition model is stored on the server side, and the field voice recognition model for recognizing the voice to be recognized can be determined according to the field identification corresponding to the voice to be recognized in the voice recognition request. Therefore, the voice recognition can be more targeted, and the voice belonging to different fields is recognized by adopting the corresponding field voice recognition models, so that the recognition result can be more accurate.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flow chart of a method for speech recognition according to an embodiment of the present application;

FIG. 2 is a flow chart of a method for speech recognition according to an embodiment of the present application;

FIG. 3 is a flow chart of a method for speech recognition according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application;

FIG. 5 is a schematic structural diagram of an apparatus for speech recognition according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a terminal provided in an embodiment of the present application;

fig. 7 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The embodiment of the application provides a voice recognition method, which can be realized by a server and a terminal together. The terminal can be a mobile phone, a notebook computer, a tablet computer and other devices. In an exemplary implementation environment of the embodiment of the application, the terminal may have a voice acquisition function, a user may input voice data to be recognized to the terminal in a speaking manner, the terminal sends the recognized voice data to the server, and the server recognizes the voice data to be recognized through the voice recognition model to obtain a recognition result and returns the recognition result to the terminal.

Fig. 1 is a flowchart of a method for speech recognition according to an embodiment of the present application, where the method may be implemented by a server. Referring to fig. 1, the steps of this embodiment include:

step 101, receiving a voice recognition request sent by a terminal.

The voice recognition request carries to-be-recognized voice data and a corresponding first domain identifier.

And 102, determining a domain voice recognition model for recognizing the voice data to be recognized based on the first domain identification and the corresponding relation between the pre-stored domain identification and the domain voice recognition model.

And 103, determining result text data corresponding to the voice data to be recognized based on the domain voice recognition model for recognizing the voice data to be recognized.

And step 104, sending the result text data to the terminal.

Fig. 2 is a flowchart of a method for speech recognition according to an embodiment of the present application, where the method may be implemented by a terminal. Referring to fig. 2, the steps of this embodiment include:

step 201, acquiring voice data to be recognized;

step 202, determining a target field identifier to which the voice data to be recognized belongs;

step 203, sending the voice recognition request to a server.

The voice recognition request carries to-be-recognized voice data and the target field identification, the voice recognition request is used for indicating the server to determine a field voice recognition model for recognizing the to-be-recognized voice data based on the target field identification and the corresponding relation between the pre-stored field identification and the field voice recognition model, and the result text data corresponding to the to-be-recognized voice data is determined based on the field voice recognition model for recognizing the to-be-recognized voice data.

And step 204, receiving the result text data sent by the server.

Fig. 3 is a flowchart of a method for speech recognition according to an embodiment of the present application, where the method may be implemented by a server and a terminal together. Referring to fig. 3, the steps of this embodiment include:

step 301, the terminal acquires voice data to be recognized.

In implementation, the terminal may obtain the voice data to be recognized, and the specific obtaining manner may be various. For example, the terminal may have a voice collecting function, and the voice collecting function may be implemented by a voice collecting device, and the user inputs voice data to the terminal by speaking. For another example, the user may also collect voice data through some external voice collecting devices, and transmit the voice data to the terminal in a data transmission manner. For another example, the user may download voice data in the terminal via the internet or the like. The following describes a case where a user inputs voice data to a terminal by speaking.

The terminal may provide the user with an operable interface through which the user may include voice capture options. The user can press the voice acquisition option for a long time, the terminal can prompt the user to input voice in a voice or text mode, the user can speak the voice to be recognized, namely the voice data to be recognized, and the terminal stops acquiring the voice data to be recognized after the user stops pressing for a long time. For example, the user presses the voice acquisition option for a long time and speaks "hello", and then the terminal acquires "hello" as voice data to be recognized.

Step 302, the terminal determines a first domain identifier to which the voice data to be recognized belongs.

The domain identifier is used to represent the domain to which the voice data belongs, such as the medical domain, the diet domain, the communication technology domain, and the like, and the domain identifier is represented by binary numbers, such as 0001 for the medical domain and 0002 for the diet domain.

In implementation, the user may also select the domain to which the voice data to be recognized belongs, and the terminal may select the domain to which the voice data to be recognized belongs according to the voice data to be recognized, and determine the corresponding domain identifier.

In one possible implementation, the user may specify the domain of the voice data to be recognized, and accordingly, the processing in step 302 may be as follows: and the terminal receives a domain selection instruction input by a user, and determines a domain identifier corresponding to the domain selection instruction as a target domain identifier to which the voice data to be recognized belongs.

In implementation, the terminal may include a field selection option in an operable interface provided for the user, and after the user selects the field selection option, sub-options of a medical field, a dietary field, a communication technology field, general use and the like may be popped up. And if the user selects one of the sub-options, namely, if the user inputs a field selection instruction to the terminal, the voice data to be recognized at this time belongs to the field represented by the corresponding sub-option. The terminal may further store a corresponding relationship between the domain and the domain identifier, for example, as shown in table 1.

TABLE 1

FIELD	Domain identification
		Medical science	0001
Diet	0002
		Communication technology	0003
General purpose	0004
		…	…

After the terminal receives a domain selection instruction input by a user, a corresponding domain identifier can be determined according to the corresponding relation table, and the domain selection instruction is a first domain identifier to which the voice data to be recognized belongs.

In another possible implementation manner, the user does not select the domain to which the voice data to be recognized belongs, and the processing in step 302 may be as follows: and determining the preset domain identification as a first domain identification to which the voice data to be identified belongs.

In implementation, if the user does not select the domain to which the voice data to be recognized belongs, the terminal may assign a preset domain identifier to the domain to be recognized. For example, the preset domain identification may be 0000.

Step 303, the terminal sends a voice recognition request to the server.

The voice recognition request carries the voice data to be recognized and the first domain identification.

In implementation, the terminal sends a voice recognition request to the server, where the voice recognition request may be a packet encapsulated according to a preset communication protocol, and the packet includes a field carrying a domain identifier. And the terminal writes the determined first domain identification into the field. And the server receives the message sent by the terminal and decapsulates the message according to the preset communication protocol encapsulation. And acquiring the first domain identification and the voice data.

And step 304, the server determines a domain voice recognition model for recognizing the voice data to be recognized based on the first domain identifier and the corresponding relationship between the pre-stored domain identifier and the domain voice recognition model.

The domain speech recognition model is a model for recognizing speech data of a specific domain, can be a machine learning model, and is obtained by training the speech data of the specific domain and corresponding text data.

In an implementation, the server may store a correspondence between the domain identification and the domain speech recognition model, for example, as shown in table 2.

TABLE 2

According to the first domain identifier, the corresponding domain speech recognition model can be found in the corresponding relation table of the domain identifier and the domain speech recognition model, and is used for recognizing the speech data to be recognized. Of course, the domain identifier carried in the speech recognition request may also occur, and does not exist in the correspondence table. The following description will be made for the case where the first domain identifier and the corresponding speech recognition model exist in the correspondence relationship between the domain identifier and the speech recognition model, and the case where the first domain identifier and the corresponding speech recognition model do not exist.

And in the case I, a first domain identifier and a corresponding first voice recognition model exist in the corresponding relation between the domain identifier and the voice recognition model.

The reason for this is that the user inputs a domain selection instruction, that is, selects the domain to which the voice data to be recognized belongs, and then the first domain identifier carried in the voice recognition request is determined according to the user selection. For the domain identification selectable by the user, the corresponding relation between the domain identification and the domain speech recognition model is stored in the server.

The corresponding processing may be as follows: if the first domain speech recognition model corresponding to the first domain identifier exists in the corresponding relationship between the pre-stored domain identifier and the domain speech recognition model, the first domain speech recognition model can be directly determined as the domain speech recognition model for recognizing the speech data to be recognized.

In a possible implementation manner, the voice recognition request may also carry a device identifier of a terminal that sends the voice recognition request, and the server may record a receiving time of the voice recognition request. The server can also record the relevant information of the voice recognition request, and the main recorded relevant information can include: a technical person can establish a buffer pool in the storage space of the server for storing the relevant information of the received voice recognition request by the equipment identifier of the terminal sending the voice recognition request, the first field identifier carried in the voice recognition request and the receiving time of the voice recognition request. As shown in table 3 below, is an exemplary form of storing information related to a voice recognition request.

TABLE 3

The server may record the relevant information of the voice recognition request after determining the domain voice recognition model for recognizing the voice data to be recognized, and the recording method may be as follows: the server records first receiving time of the voice recognition request, if the corresponding relation of the equipment identifier of the terminal, the second field identifier and the second receiving time is stored, the second field identifier is updated to be the first field identifier, and the second receiving time is updated to be the first receiving time; and if the corresponding relation among the equipment identifier, the second domain identifier and the second receiving time of the terminal is not stored, correspondingly storing the equipment identifier, the first domain identifier and the first receiving time.

The device identifier may be an I P (internet protocol) address of the terminal.

In implementation, in the correspondence between the device identifier, the domain identifier, and the receiving time, if a correspondence that is the same as both the device identifier and the domain identifier in the related information of the voice recognition request received this time is found, the receiving time in the found correspondence is updated to the first receiving time of the voice recognition request received this time. If not, directly writing the relevant information of the received voice recognition request into the corresponding relation among the equipment identifier, the field identifier and the receiving time.

For example, as shown in table 3 above, the relevant information of the voice recognition request recorded in the server is that the device identifier carried in the voice recognition request of this time is device identifier 1, the domain identifier is 0001, and the receiving time is receiving time 6. By referring to table 3, it can be found that a correspondence relationship among a piece of device identifier 1, domain identifier 0001, and reception time 1 has been recorded, and then the reception time in the correspondence relationship can be updated from reception time 1 to reception time 6.

For another example, as shown in table 3 above, the information related to the voice recognition request recorded in the server is that the device identifier carried in the current voice recognition request is device identifier 1, the domain identifier is 0006, and the receiving time is receiving time 8. By looking up table 3, the device identifier 1, the domain identifier 0006 and the receiving time are not looked up, and then the device identifier 1, the domain identifier 0006 and the receiving time 8 may be stored in table 3.

And in the second situation, the corresponding relation between the domain identifier and the voice recognition model does not have the first domain identifier and the corresponding first voice recognition model.

The reason for this is that the user does not input a domain selection instruction to the terminal, that is, does not select the domain to which the voice data to be recognized belongs, and then the first domain identifier carried in the voice recognition request is the preset domain identifier allocated to the terminal, and the preset domain identifier does not store the corresponding domain voice recognition model in the server. Then, it is necessary to determine a domain speech recognition model corresponding to the preset domain identifier in combination with information related to the previously stored speech recognition request.

The corresponding processing may be as follows: determining first receiving time of the voice recognition request, and if a domain voice recognition model corresponding to the first domain identifier does not exist in the corresponding relation between the pre-stored domain identifier and the domain voice recognition model, determining whether second receiving time corresponding to the equipment identifier of the terminal in a preset time length before the first receiving time is stored; and if the second receiving time exists, determining a second field identification corresponding to the second receiving time, determining a second field voice recognition model corresponding to the second field identification based on the corresponding relation between the prestored field identification and the field voice recognition model, determining the second field voice recognition model as the field voice recognition model for recognizing the voice data to be recognized, and updating the second receiving time corresponding to the stored equipment identification of the terminal to the first receiving time.

In implementation, if the domain speech recognition model corresponding to the first domain identifier is not queried, it may be determined whether the speech recognition request sent by the terminal is received by querying whether the stored information related to the speech recognition request includes the device identifier of the terminal. And if the equipment identification of the terminal is not inquired, the voice live broadcast request sent by the terminal is not received before. Then, the speech data to be recognized may be input into the general-purpose domain recognition model first, so as to obtain candidate text data corresponding to the number of the speech to be recognized. The subsequent text data may then be input to a domain recognition model, which may output a third domain identification, and a confidence that the candidate text data belongs to the corresponding domain of the third domain identification. If the confidence is greater than a preset threshold, the candidate text data may be considered to belong to the third domain. Then, the third domain speech recognition model corresponding to the third domain identification may be used as the domain speech recognition model for recognizing the speech data to be recognized.

If the device identifier of the terminal is inquired, whether second receiving time corresponding to the terminal identifier is within a preset time length before first receiving time in the stored corresponding relation among the device identifier, the field identifier and the receiving time is judged. If so, inquiring a second field speech recognition model corresponding to the second field identifier according to the second field identifier corresponding to the equipment identifier and the second receiving time and the corresponding relation between the field identifier and the field speech recognition model, wherein the second field speech recognition model is determined as the field speech recognition model for recognizing the speech data to be recognized. In this case, the server may also record information related to the current voice recognition request. Since the speech recognition of the speech data to be recognized is performed by using the second-domain speech recognition model corresponding to the second-domain identifier, the receiving time in the corresponding relation among the queried device identifier, the second-domain identifier and the receiving time only needs to be updated to the first receiving time.

In the second case, if the second receiving time corresponding to the device identifier of the terminal is not queried, the following process may be performed: and inputting the voice data to be recognized into the general field voice recognition model to obtain candidate text data. And inputting the candidate text data into the field recognition model to obtain a third field identification corresponding to the candidate text data and a confidence coefficient of the candidate text data belonging to the field corresponding to the third field identification. And if the confidence coefficient is greater than the preset threshold value, determining a third field speech recognition model corresponding to the third field identifier based on the corresponding relation between the prestored field identifier and the field speech recognition model, and determining the third field speech recognition model as the field speech recognition model for recognizing the speech data to be recognized.

The domain identification model is a model for identifying the domain to which the text data belongs, and the model can be a machine learning model obtained by training a large amount of text data and domain identifications corresponding to the text data as samples.

Step 305, the server determines the result text data corresponding to the voice data to be recognized based on the domain voice recognition model for recognizing the voice data to be recognized.

The output text data is different according to different functions of the domain voice recognition models of the voice data to be recognized, and can be Chinese characters, English characters, Japanese characters and the like.

In implementation, for the case that the confidence obtained after the candidate text data is input into the domain recognition model in step 304 is smaller than the preset threshold, the general domain speech recognition model may be used as the domain speech recognition model for recognizing the speech data to be recognized, and accordingly, the candidate text data may be directly determined as the result text data.

In the case where the third domain speech recognition model is determined as the domain speech recognition model for recognizing the speech data to be recognized, the speech data to be recognized may be input to the third domain recognition speech recognition model in this step, and the result text data may be obtained.

In a possible implementation manner, for the case that the confidence obtained after the candidate text data is input into the domain recognition model is greater than the preset threshold, the result text data obtained in step 305 and the speech data to be recognized corresponding to the result text data may be used as a set of training samples. And the training sample and the domain identification corresponding to the domain speech recognition model of the result text data in the training sample can be correspondingly stored. When the number of training samples corresponding to any field identifier reaches a preset number, the training samples corresponding to the field identifier can be obtained, and the field speech recognition model corresponding to the field identifier is trained and updated. Then, when the domain speech recognition model corresponding to the domain identifier is used again for speech recognition, the domain speech recognition model after the training update can be used.

Step 306, the server sends the result text data to the terminal.

In implementation, the server returns the recognized result text data to the terminal, and meanwhile, when the voice data to be recognized is recognized, the domain identifier corresponding to the used domain voice recognition model is also returned to the terminal, and the terminal can display the domain to which the voice recognition belongs to the user according to the domain identifier.

All the above optional technical solutions may be combined arbitrarily to form the optional embodiments of the present disclosure, and are not described herein again.

Based on the same technical concept, an embodiment of the present application further provides a speech recognition apparatus, where the apparatus is applied to a server, and the apparatus includes: a receiving module 410, a determining module 420, an identifying module 430, and a transmitting module 440.

A receiving module 410, configured to receive a voice recognition request sent by a terminal, where the voice recognition request carries to-be-recognized voice data and a corresponding first domain identifier;

a determining module 420, configured to determine, based on the first domain identifier and a correspondence between a domain identifier and a domain speech recognition model that are stored in advance, a domain speech recognition model for recognizing the speech data to be recognized;

the recognition module 430 is configured to determine, based on the domain speech recognition model for recognizing the speech data to be recognized, result text data corresponding to the speech data to be recognized;

a sending module 440, configured to send the result text data to the terminal.

Optionally, the determining module 420 is configured to:

Optionally, the voice recognition request carries an equipment identifier of the terminal, and the receiving module 410 is further configured to:

determining a first reception time of the voice recognition request;

the device further comprises:

determining a first reception time of the voice recognition request;

the determining module 420 is configured to:

Optionally, the determining module 420 is further configured to:

Optionally, the apparatus further comprises:

Based on the same technical concept, an embodiment of the present application further provides a device for speech recognition, where the device is applied to a terminal, and the device includes: an obtaining module 510, a determining module 520, a sending module 530 and a receiving module 540.

An obtaining module 510, configured to obtain voice data to be recognized;

a determining module 520, configured to determine a target domain identifier to which the voice data to be recognized belongs;

a sending module 530, configured to send the voice recognition request to a server, where the voice recognition request carries the to-be-recognized voice data and the target domain identifier, and the voice recognition request is used to instruct the server to determine a domain voice recognition model for recognizing the to-be-recognized voice data based on the target domain identifier and a correspondence between a domain identifier and a domain voice recognition model that are stored in advance, and determine result text data corresponding to the to-be-recognized voice data based on the domain voice recognition model for recognizing the to-be-recognized voice data;

a receiving module 540, configured to receive the result text data sent by the server.

Optionally, the determining module 520 is configured to:

receiving a field selection instruction input by a user;

Optionally, the determining module 520 is configured to:

It should be noted that: in the speech recognition apparatus provided in the above embodiment, only the division of the above functional modules is used for illustration in speech recognition, and in practical applications, the above functions may be distributed by different functional modules according to needs, that is, the internal structure of the server is divided into different functional modules to complete all or part of the above described functions. In addition, the speech recognition apparatus provided in the above embodiments and the speech recognition method embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments and are not described herein again.

Fig. 6 shows a block diagram of a terminal 600 according to an exemplary embodiment of the present application. The terminal 600 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. The terminal 600 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, etc.

In general, the terminal 600 includes: a processor 601 and a memory 602.

The processor 601 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. The processor 601 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 601 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 601 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, processor 601 may also include an AI (Artificial Intelligence) processor for processing computational operations related to machine learning.

The memory 602 may include one or more computer-readable storage media, which may be non-transitory. The memory 602 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 602 is used to store at least one instruction for execution by processor 601 to implement the method of speech recognition provided by the method embodiments herein.

In some embodiments, the terminal 600 may further optionally include: a peripheral interface 603 and at least one peripheral. The processor 601, memory 602, and peripheral interface 603 may be connected by buses or signal lines. Various peripheral devices may be connected to the peripheral interface 603 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 604, a touch screen display 605, a camera 606, an audio circuit 607, a positioning component 608, and a power supply 609.

The peripheral interface 603 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 601 and the memory 602. In some embodiments, the processor 601, memory 602, and peripheral interface 603 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 601, the memory 602, and the peripheral interface 603 may be implemented on a separate chip or circuit board, which is not limited in this embodiment.

The Radio Frequency circuit 604 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 604 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 604 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 604 comprises: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuitry 604 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the rf circuit 604 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display 605 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 605 is a touch display screen, the display screen 605 also has the ability to capture touch signals on or over the surface of the display screen 605. The touch signal may be input to the processor 601 as a control signal for processing. At this point, the display 605 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 605 may be one, providing the front panel of the terminal 600; in other embodiments, the display 605 may be at least two, respectively disposed on different surfaces of the terminal 600 or in a folded design; in still other embodiments, the display 605 may be a flexible display disposed on a curved surface or on a folded surface of the terminal 600. Even more, the display 605 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The Display 605 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and the like.

The camera assembly 606 is used to capture images or video. Optionally, camera assembly 606 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 606 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

Audio circuitry 607 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 601 for processing or inputting the electric signals to the radio frequency circuit 604 to realize voice communication. For the purpose of stereo sound collection or noise reduction, a plurality of microphones may be provided at different portions of the terminal 600. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 601 or the radio frequency circuit 604 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, audio circuitry 607 may also include a headphone jack.

The positioning component 608 is used for positioning the current geographic Location of the terminal 600 to implement navigation or LBS (Location Based Service). The Positioning component 608 can be a Positioning component based on the united states GPS (Global Positioning System), the chinese beidou System, the russian graves System, or the european union's galileo System.

Power supply 609 is used to provide power to the various components in terminal 600. The power supply 609 may be ac, dc, disposable or rechargeable. When the power supply 609 includes a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, the terminal 600 also includes one or more sensors 610. The one or more sensors 610 include, but are not limited to: acceleration sensor 611, gyro sensor 612, pressure sensor 613, fingerprint sensor 614, optical sensor 615, and proximity sensor 616.

The acceleration sensor 611 may detect the magnitude of acceleration in three coordinate axes of the coordinate system established with the terminal 600. For example, the acceleration sensor 611 may be used to detect components of the gravitational acceleration in three coordinate axes. The processor 601 may control the touch screen display 605 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 611. The acceleration sensor 611 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 612 may detect a body direction and a rotation angle of the terminal 600, and the gyro sensor 612 and the acceleration sensor 611 may cooperate to acquire a 3D motion of the user on the terminal 600. The processor 601 may implement the following functions according to the data collected by the gyro sensor 612: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

The pressure sensor 613 may be disposed on a side frame of the terminal 600 and/or on a lower layer of the touch display screen 605. When the pressure sensor 613 is disposed on the side frame of the terminal 600, a user's holding signal of the terminal 600 can be detected, and the processor 601 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 613. When the pressure sensor 613 is disposed at the lower layer of the touch display screen 605, the processor 601 controls the operability control on the UI interface according to the pressure operation of the user on the touch display screen 605. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 614 is used for collecting a fingerprint of a user, and the processor 601 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 614, or the fingerprint sensor 614 identifies the identity of the user according to the collected fingerprint. Upon identifying that the user's identity is a trusted identity, the processor 601 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings, etc. The fingerprint sensor 614 may be disposed on the front, back, or side of the terminal 600. When a physical button or vendor Logo is provided on the terminal 600, the fingerprint sensor 614 may be integrated with the physical button or vendor Logo.

The optical sensor 615 is used to collect the ambient light intensity. In one embodiment, processor 601 may control the display brightness of touch display 605 based on the ambient light intensity collected by optical sensor 615. Specifically, when the ambient light intensity is high, the display brightness of the touch display screen 605 is increased; when the ambient light intensity is low, the display brightness of the touch display screen 605 is turned down. In another embodiment, the processor 601 may also dynamically adjust the shooting parameters of the camera assembly 606 according to the ambient light intensity collected by the optical sensor 615.

A proximity sensor 616, also known as a distance sensor, is typically disposed on the front panel of the terminal 600. The proximity sensor 616 is used to collect the distance between the user and the front surface of the terminal 600. In one embodiment, when the proximity sensor 616 detects that the distance between the user and the front surface of the terminal 600 gradually decreases, the processor 601 controls the touch display 605 to switch from the bright screen state to the dark screen state; when the proximity sensor 616 detects that the distance between the user and the front surface of the terminal 600 gradually becomes larger, the processor 601 controls the touch display 605 to switch from the breath screen state to the bright screen state.

Those skilled in the art will appreciate that the configuration shown in fig. 6 is not intended to be limiting of terminal 600 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.

Fig. 7 is a schematic structural diagram of a server 700 according to an embodiment of the present application, where the server 700 may generate a relatively large difference due to a difference in configuration or performance, and may include one or more processors (CPUs) 701 and one or more memories 702, where at least one instruction is stored in the memory 702, and the at least one instruction is loaded and executed by the processors 701 to implement the voice recognition method provided by each method embodiment. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the server may also include other components for implementing the functions of the device, which are not described herein again.

In an exemplary embodiment, a computer-readable storage medium, such as a memory, including instructions executable by a processor in a terminal to perform the method of speech recognition in the embodiments described below is also provided. For example, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method of speech recognition, the method being applied to a server, the method comprising:

and sending the result text data to the terminal.

2. The method according to claim 1, wherein the determining a domain speech recognition model for recognizing the speech data to be recognized based on the first domain identifier and a pre-stored correspondence between the domain identifier and the domain speech recognition model comprises:

3. The method according to claim 2, wherein the voice recognition request carries a device identifier of the terminal, and the method further comprises:

determining a first reception time of the voice recognition request;

4. The method according to claim 1, wherein the voice recognition request carries a device identifier of the terminal, and the method further comprises:

determining a first reception time of the voice recognition request;

5. The method of claim 4, further comprising:

6. The method of claim 5, further comprising:

7. A method for voice recognition, the method being applied to a terminal, the method comprising:

acquiring voice data to be recognized;

sending a voice recognition request to a server, wherein the voice recognition request carries the voice data to be recognized and the target field identification, the voice recognition request is used for indicating the server to determine a field voice recognition model for recognizing the voice data to be recognized based on the target field identification and a corresponding relation between a pre-stored field identification and a field voice recognition model, and the result text data corresponding to the voice data to be recognized is determined based on the field voice recognition model for recognizing the voice data to be recognized;

and receiving the result text data sent by the server.

8. The method according to claim 7, wherein the determining the target domain identifier corresponding to the voice data to be recognized comprises:

receiving a field selection instruction input by a user;

9. The method according to claim 7, wherein the determining the target domain identifier corresponding to the voice data to be recognized comprises:

10. An apparatus for speech recognition, the apparatus being applied to a server, the apparatus comprising:

11. An apparatus for speech recognition, wherein the method is applied to a terminal, and the apparatus comprises:

the acquisition module is used for acquiring voice data to be recognized;