CN112992127B

CN112992127B - Voice recognition method and device

Info

Publication number: CN112992127B
Application number: CN201911275670.3A
Authority: CN
Inventors: 董勤波; 陈展; 周洪伟
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2019-12-12
Filing date: 2019-12-12
Publication date: 2024-05-07
Anticipated expiration: 2039-12-12
Also published as: CN112992127A

Abstract

The application discloses a voice recognition method, and belongs to the field of voice recognition. The method comprises the following steps: receiving a voice recognition request sent by a terminal, wherein the voice recognition request carries voice data to be recognized and a corresponding first field identifier; determining a domain voice recognition model for recognizing the voice data to be recognized based on the first domain identifier and a corresponding relation between a pre-stored domain identifier and the domain voice recognition model; determining result text data corresponding to the voice data to be recognized based on the field voice recognition model for recognizing the voice data to be recognized; and sending the result text data to the terminal. The application can provide the accuracy of voice recognition.

Description

Voice recognition method and device

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a method and apparatus for speech recognition.

Background

Speech recognition technology is widely used in people's daily lives, for example, converting audio data into text data by speech recognition.

In the related art, a general terminal can send a voice recognition request to a server, the server invokes a general field voice recognition model to recognize voice data carried in the voice recognition request, and a recognition result is returned to the terminal. The generic-domain speech recognition model here is typically a machine learning model, which then needs to be trained before it can be used. For example, for a general-purpose speech recognition model used to convert audio data into text data, a technician may use a large amount of commonly used speech data, text data, as samples for training before using the general-purpose speech recognition model. The trained universal field voice recognition model can accurately recognize some common voice data, such as 'hello', 'you are' and the like.

In carrying out the present application, the inventors have found that the related art has at least the following problems:

In the related art, since the samples are all common voice data and text data during training, the common field voice recognition model is used for recognition for some voice data with strong professionals, such as medical terms, communication terms, and the like, and the recognition accuracy is low.

Disclosure of Invention

The embodiment of the application provides a voice recognition method, which can solve the problem of low voice recognition accuracy. The technical scheme is as follows:

in a first aspect, a method of speech recognition is provided, the method comprising:

Receiving a voice recognition request sent by a terminal, wherein the voice recognition request carries voice data to be recognized and a corresponding first field identifier;

Determining a domain voice recognition model for recognizing the voice data to be recognized based on the first domain identifier and a corresponding relation between a pre-stored domain identifier and the domain voice recognition model;

determining result text data corresponding to the voice data to be recognized based on the field voice recognition model for recognizing the voice data to be recognized;

and sending the result text data to the terminal.

Optionally, the determining, based on the first domain identifier and the correspondence between the pre-stored domain identifier and the domain speech recognition model, the domain speech recognition model for recognizing the speech data to be recognized includes:

If a first domain voice recognition model corresponding to the first domain identifier exists in the corresponding relation between the pre-stored domain identifier and the domain voice recognition model, determining the first domain voice recognition model as a domain voice recognition model for recognizing the voice data to be recognized.

Optionally, the voice recognition request carries a device identifier of the terminal, and the method further includes:

Determining a first time of receipt of the speech recognition request;

after the first domain voice recognition model is determined as the domain voice recognition model for recognizing the voice data to be recognized, the method further comprises:

If the corresponding relation among the equipment identifier, the second domain identifier and the second receiving time of the terminal is stored, updating the second domain identifier to the first domain identifier, and updating the second receiving time to the first receiving time;

and if the corresponding relation among the equipment identifier of the terminal, the second domain identifier and the second receiving time is not stored, correspondingly storing the equipment identifier, the first domain identifier and the first receiving time.

Determining a first time of receipt of the speech recognition request;

The determining, based on the first domain identifier and the correspondence between the pre-stored domain identifier and the domain voice recognition model, the domain voice recognition model for recognizing the voice data to be recognized includes:

if the domain voice recognition model corresponding to the first domain identifier does not exist in the corresponding relation between the pre-stored domain identifier and the domain voice recognition model, determining whether a second receiving time corresponding to the equipment identifier of the terminal in a preset time period before the first receiving time is stored;

If the second receiving time exists, determining a second domain identifier corresponding to the second receiving time, determining a second domain voice recognition model corresponding to the second domain identifier based on a corresponding relation between the pre-stored domain identifier and the domain voice recognition model, determining the second domain voice recognition model as a domain voice recognition model for recognizing the voice data to be recognized, and updating the stored second receiving time corresponding to the equipment identifier of the terminal to the first receiving time.

Optionally, the method further comprises:

if the second receiving time does not exist, inputting the voice data to be recognized into a universal field voice recognition model to obtain candidate text data;

inputting the candidate text data into a domain identification model to obtain a third domain identifier corresponding to the candidate text data and a confidence coefficient of the domain corresponding to the third domain identifier of the candidate text data;

if the confidence coefficient is larger than a preset threshold value, determining a third domain voice recognition model corresponding to the third domain identifier based on a corresponding relation between a pre-stored domain identifier and a domain voice recognition model, and determining the third domain voice recognition model as the domain voice recognition model for recognizing the voice data to be recognized.

Optionally, the method further comprises:

and if the confidence coefficient is smaller than a preset threshold value, determining the universal domain voice recognition model as the domain voice recognition model for recognizing the voice data to be recognized.

Optionally, the method further comprises:

The result text data and the voice data to be recognized corresponding to the result text data are used as a group of training samples;

correspondingly storing the training sample and the third domain identifier;

and training the third domain voice model corresponding to the third domain identifier according to the stored training samples corresponding to the third domain identifier when the training samples corresponding to the third domain identifier reach the preset group number.

In a second aspect, there is provided a method of speech recognition, the method comprising:

acquiring voice data to be recognized;

Determining a target field identifier to which the voice data to be recognized belong;

A voice recognition request is sent to a server, wherein the voice recognition request carries the voice data to be recognized and the target domain identifier, the voice recognition request is used for indicating the server to determine a domain voice recognition model for recognizing the voice data to be recognized based on the target domain identifier and a corresponding relation between a pre-stored domain identifier and a domain voice recognition model, and the result text data corresponding to the voice data to be recognized is determined based on the domain voice recognition model for recognizing the voice data to be recognized;

And receiving the result text data sent by the server.

Optionally, the determining the target domain identifier corresponding to the voice data to be identified includes:

Receiving a domain selection instruction input by a user;

and determining the domain identifier corresponding to the domain selection instruction as a first domain identifier to which the voice data to be recognized belongs.

And determining a preset domain identification mark as a first domain mark to which the voice data to be identified belong.

In a third aspect, there is provided an apparatus for speech recognition, the apparatus comprising:

The receiving module is used for receiving a voice recognition request sent by the terminal, wherein the voice recognition request carries voice data to be recognized and a corresponding first field identifier;

The determining module is used for determining a domain voice recognition model for recognizing the voice data to be recognized based on the first domain identifier and the corresponding relation between the pre-stored domain identifier and the domain voice recognition model;

the recognition module is used for determining result text data corresponding to the voice data to be recognized based on the domain voice recognition model for recognizing the voice data to be recognized;

and the sending module is used for sending the result text data to the terminal.

Optionally, the determining module is configured to:

Optionally, the voice recognition request carries an equipment identifier of the terminal, and the receiving module is further configured to:

Determining a first time of receipt of the speech recognition request;

the apparatus further comprises:

The storage module is used for updating the second domain identifier into the first domain identifier and updating the second receiving time into the first receiving time if the corresponding relation among the equipment identifier, the second domain identifier and the second receiving time of the terminal is stored; and if the corresponding relation among the equipment identifier of the terminal, the second domain identifier and the second receiving time is not stored, correspondingly storing the equipment identifier, the first domain identifier and the first receiving time.

Determining a first time of receipt of the speech recognition request;

The determining module is used for:

If the domain voice recognition model corresponding to the first domain identifier does not exist in the corresponding relation between the pre-stored domain identifier and the domain voice recognition model, determining whether a second receiving time corresponding to the equipment identifier of the terminal in a preset time period before the first receiving time is stored; if the second receiving time exists, determining a second domain identifier corresponding to the second receiving time, determining a second domain voice recognition model corresponding to the second domain identifier based on a corresponding relation between the pre-stored domain identifier and the domain voice recognition model, determining the second domain voice recognition model as a domain voice recognition model for recognizing the voice data to be recognized, and updating the stored second receiving time corresponding to the equipment identifier of the terminal to the first receiving time.

Optionally, the determining module is further configured to:

Optionally, the apparatus further includes:

the training module is used for taking the result text data and the voice data to be recognized corresponding to the result text data as a group of training samples; correspondingly storing the training sample and the third domain identifier;

In a fourth aspect, there is provided an apparatus for speech recognition, the apparatus comprising:

the acquisition module is used for acquiring voice data to be identified;

The determining module is used for determining a target field identifier to which the voice data to be recognized belong;

The voice recognition module is used for sending a voice recognition request to a server, wherein the voice recognition request carries the voice data to be recognized and the target domain identifier, the voice recognition request is used for indicating the server to determine a domain voice recognition model for recognizing the voice data to be recognized based on the target domain identifier and a corresponding relation between a prestored domain identifier and a domain voice recognition model, and determining result text data corresponding to the voice data to be recognized based on the domain voice recognition model for recognizing the voice data to be recognized;

And the receiving module is used for receiving the result text data sent by the server.

Optionally, the determining module is configured to:

Receiving a domain selection instruction input by a user;

Optionally, the determining module is configured to:

In a fifth aspect, a server is provided, the server comprising a processor and a memory, the memory having stored therein at least one instruction that is loaded and executed by the processor to implement the method of speech recognition as described in the first aspect above.

In a sixth aspect, there is provided a terminal comprising a processor and a memory having stored therein at least one instruction loaded and executed by the processor to implement the method of speech recognition as described in the second aspect above.

In a seventh aspect, there is provided a computer readable storage medium having stored therein at least one instruction that is loaded and executed by a processor to implement the method of speech recognition as described in the first and second aspects above.

The technical scheme provided by the embodiment of the application has the beneficial effects that:

The corresponding relation between the domain identifier and the domain voice recognition model is stored at the server side, and the domain voice recognition model for recognizing the voice to be recognized can be determined according to the domain identifier corresponding to the voice to be recognized in the voice recognition request. Thus, the voice recognition can be more targeted, the voice belonging to different fields is recognized by adopting the corresponding field voice recognition model, and the recognition result can be more accurate.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for speech recognition according to an embodiment of the present application;

FIG. 2 is a flow chart of a method for speech recognition according to an embodiment of the present application;

FIG. 3 is a flow chart of a method for speech recognition according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a voice recognition device according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a voice recognition device according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a terminal according to an embodiment of the present application;

Fig. 7 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.

The embodiment of the application provides a voice recognition method which can be jointly realized by a server and a terminal. The terminal can be a mobile phone, a notebook computer, a tablet computer and other devices. In an exemplary implementation environment of the embodiment of the present application, a terminal may have a voice acquisition function, a user may input voice data to be recognized to the terminal by speaking, the terminal sends the recognized voice data to a server, and the server recognizes the voice data to be recognized through a voice recognition model, obtains a recognition result, and returns the recognition result to the terminal.

Fig. 1 is a flowchart of a method for speech recognition, which may be implemented by a server, according to an embodiment of the present application. Referring to fig. 1, the steps of this embodiment include:

Step 101, receiving a voice recognition request sent by a terminal.

The voice recognition request carries voice data to be recognized and a corresponding first field identifier.

Step 102, determining a domain voice recognition model for recognizing voice data to be recognized based on the first domain identifier and a corresponding relation between the pre-stored domain identifier and the domain voice recognition model.

Step 103, determining the result text data corresponding to the voice data to be recognized based on the domain voice recognition model for recognizing the voice data to be recognized.

And 104, sending the result text data to the terminal.

Fig. 2 is a flowchart of a method for voice recognition, which may be implemented by a terminal according to an embodiment of the present application. Referring to fig. 2, the steps of this embodiment include:

step 201, obtaining voice data to be recognized;

Step 202, determining a target domain identifier to which the voice data to be recognized belongs;

Step 203, sending the voice recognition request to a server.

The voice recognition request is used for indicating a server to determine a domain voice recognition model for recognizing the voice data to be recognized based on the target domain identifier and a corresponding relation between the pre-stored domain identifier and the domain voice recognition model, the voice data to be recognized is determined based on the domain voice recognition model for recognizing the voice data to be recognized, and result text data corresponding to the voice data to be recognized is determined.

Step 204, receiving result text data sent by the server.

Fig. 3 is a flowchart of a method for voice recognition according to an embodiment of the present application, where the method may be implemented by a server and a terminal together. Referring to fig. 3, the steps of this embodiment include:

Step 301, the terminal acquires voice data to be recognized.

In implementation, the terminal may acquire the voice data to be identified, and the specific acquisition modes may be various. For example, the terminal may have a voice acquisition function therein, which may be implemented by a voice acquisition device, and the user inputs voice data to the terminal by speaking. For another example, the user may collect voice data through some external voice collection devices, and transmit the voice data to the terminal through a data transmission manner. For another example, the user may download voice data in the terminal via the internet or the like. The following describes a user inputting voice data to a terminal by speaking.

The terminal may provide an operable interface to the user, which may be provided with a voice acquisition option. The user can press the voice acquisition option for a long time, the terminal can prompt the user to input voice in the form of voice or characters, the user can speak the voice to be recognized, namely the voice data to be recognized, and after the user stops pressing for a long time, the terminal stops acquiring the voice data to be recognized. For example, the user long presses the voice acquisition option and speaks "hello", and then the terminal acquires "hello" as the voice data to be recognized.

Step 302, the terminal determines a first domain identifier to which the voice data to be recognized belongs.

The domain identifier is used for representing the domain to which the voice data belongs, such as a medical domain, a diet domain, a communication technical domain and the like, and the domain identifier is represented by binary numbers, such as 0001 represents the medical domain and 0002 represents the diet domain.

In implementation, the user may also select a domain to which the voice data to be recognized belongs, and the terminal may determine a corresponding domain identifier according to the domain to which the voice data to be recognized belongs.

In one possible implementation, the user may specify the domain of the speech data to be recognized, and the processing in step 302 may be as follows: and the terminal receives a domain selection instruction input by a user, and determines a domain identifier corresponding to the domain selection instruction as a target domain identifier to which the voice data to be recognized belong.

In implementation, the terminal may include a field selection option in an operable interface provided for the user, and after the user selects the field selection option, sub-options such as a medical field, a diet field, a communication technical field, a general purpose, and the like may be popped up. And selecting one of the sub-options by the user, namely inputting a domain selection instruction to the terminal by the user, wherein the voice data to be identified at the time belongs to the domain represented by the corresponding sub-option. The terminal may also store a correspondence between the domain and the domain identifier, e.g. as shown in table 1.

TABLE 1

FIELD	Domain identification
		Medical science	0001
Diet and food	0002
		Communication technology	0003
Universal use	0004
		…	…

After receiving the domain selection instruction input by the user, the terminal can determine the corresponding domain identifier according to the corresponding relation table, wherein the domain selection instruction is the first domain identifier to which the voice data to be identified belongs.

In another possible implementation, the user does not select the domain to which the voice data to be recognized belongs, and the processing in step 302 may be as follows: and determining the preset domain identification mark as a first domain mark to which the voice data to be identified belong.

In implementation, if the user does not select the domain to which the voice data to be recognized belongs, the terminal may assign a preset domain identifier to the domain to be recognized. For example, the preset domain identifier may be 0000.

Step 303, the terminal sends a voice recognition request to the server.

The voice recognition request carries voice data to be recognized and a first field identifier.

In an implementation, the terminal sends a voice recognition request to the server, where the voice recognition request may be a message encapsulated according to a preset communication protocol, and the message includes a field carrying a field identifier. And the terminal writes the determined first domain identifier into the field. The server receives the message sent by the terminal and decapsulates the message according to the preset communication protocol encapsulation. And acquiring the first domain identifier and the voice data.

Step 304, the server determines a domain voice recognition model for recognizing the voice data to be recognized based on the first domain identifier and the correspondence between the pre-stored domain identifier and the domain voice recognition model.

The domain voice recognition model is a model for recognizing voice data in a specific domain, and can be a machine learning model and is obtained through training of the voice data in the specific domain and corresponding text data.

In implementations, the server may store a correspondence of the domain identification and the domain speech recognition model, e.g., as shown in table 2.

TABLE 2

According to the first domain identifier, a corresponding domain voice recognition model can be searched in a corresponding relation table of the domain identifier and the domain voice recognition model and used for recognizing voice data to be recognized. Of course, the domain identifier carried in the voice recognition request may also appear, and does not exist in the correspondence table. The following description will be made on the case where the first domain identifier and the corresponding speech recognition model are present and the case where the first domain identifier and the corresponding speech recognition model are not present in the correspondence relationship between the domain identifier and the speech recognition model, respectively.

In the first case, the corresponding relation between the domain identifier and the voice recognition model is provided with a first domain identifier and a corresponding first voice recognition model.

The reason for this is that the user has entered a domain selection instruction, i.e. has selected the domain to which the voice data to be recognized belongs, and then the first domain identifier carried in the voice recognition request is determined according to the user selection. For the user selectable domain identifier, the corresponding relation between the domain identifier and the domain voice recognition model is stored in the server.

The corresponding process may be as follows: if there is a first domain voice recognition model corresponding to the first domain identifier in the correspondence between the pre-stored domain identifier and the domain voice recognition model, the first domain voice recognition model may be directly determined as a domain voice recognition model for recognizing the voice data to be recognized.

In one possible implementation, the voice recognition request may also carry a device identifier of the terminal that sends the voice recognition request, and the server may record the time of receipt of the voice recognition request. The server may record the relevant information of the voice recognition request, and the relevant information recorded mainly may include: the device identifier of the terminal sending the voice recognition request, the first domain identifier carried in the voice recognition request, and the receiving time technician of the voice recognition request may establish a buffer pool in the storage space of the server for storing the relevant information of the received voice recognition request. As shown in table 3 below, is an exemplary form of information storage associated with a speech recognition request.

TABLE 3 Table 3

After determining the domain voice recognition model for recognizing the voice data to be recognized, the server may record the relevant information of the voice recognition request, and the recording method may be as follows: the server records the first receiving time of the voice recognition request, and if the corresponding relation among the equipment identifier, the second domain identifier and the second receiving time of the terminal is stored, the second domain identifier is updated to be the first domain identifier, and the second receiving time is updated to be the first receiving time; and if the corresponding relation among the equipment identifier, the second domain identifier and the second receiving time of the terminal is not stored, correspondingly storing the equipment identifier, the first domain identifier and the first receiving time.

The device identity may be, among other things, the I P (I nternet Protoco l ) address of the terminal.

In the implementation, in the corresponding relation among the equipment identifier, the domain identifier and the receiving time, if the corresponding relation which is the same as the equipment identifier and the domain identifier in the related information of the received voice recognition request is queried, the receiving time in the queried corresponding relation is updated to be the first receiving time of the received voice recognition request. If not, the relevant information of the received voice recognition request is directly written into the corresponding relation of the equipment identifier, the domain identifier and the receiving time.

For example, as shown in table 3 above, the relevant information of the voice recognition request recorded in the server is the device identifier 1, the domain identifier 0001, and the receiving time 6. By referring to table 3, it is possible to inquire that the correspondence relation of one piece of equipment identification 1, the area identification 0001 and the reception time 1 has been recorded, and then the reception time in the correspondence relation can be updated from the reception time 1 to the reception time 6.

For another example, as shown in table 3 above, the relevant information of the voice recognition request recorded in the server is the device identifier 1, the domain identifier 0006, and the receiving time is the receiving time 8. By referring to table 3, the correspondence between the device identifier 1, the domain identifier 0006, and the reception time is not found, and then the device identifier 1, the domain identifier 0006, and the reception time 8 may be stored in the above table 3 correspondingly.

And in the second case, the corresponding relation between the domain identifier and the voice recognition model does not contain the first domain identifier and the corresponding first voice recognition model.

The reason for this is that the user does not input a domain selection instruction to the terminal, that is, does not select the domain to which the voice data to be recognized belongs, and then the first domain identifier carried in the voice recognition request is a preset domain identifier allocated to the terminal, and the preset domain identifier does not store a corresponding domain voice recognition model in the server. Then the domain voice recognition model corresponding to the preset domain identifier needs to be determined in combination with the information related to the voice recognition request stored before.

The corresponding process may be as follows: determining a first receiving time of a voice recognition request, and if a domain voice recognition model corresponding to the first domain identifier does not exist in a pre-stored corresponding relation between the domain identifier and the domain voice recognition model, determining whether a second receiving time corresponding to the equipment identifier of the terminal in a preset time period before the first receiving time is stored; if the second receiving time exists, determining a second domain identifier corresponding to the second receiving time, determining a second domain voice recognition model corresponding to the second domain identifier based on a corresponding relation between the pre-stored domain identifier and the domain voice recognition model, determining the second domain voice recognition model as a domain voice recognition model for recognizing the voice data to be recognized, and updating the second receiving time corresponding to the stored equipment identifier of the terminal to the first receiving time.

In an implementation, if the domain voice recognition model corresponding to the first domain identifier is not queried, whether the voice recognition request sent by the terminal is received can be determined by querying whether the device identifier of the terminal exists in the stored information related to the voice recognition request. If the equipment identification of the terminal is not queried, the voice live broadcast request sent by the terminal is considered not to be received before. Then, the voice data to be recognized can be input into the universal field recognition model first, and candidate text data corresponding to the voice number to be recognized can be obtained. Then, the subsequent text data may be input to a domain identification model, which may output a third domain identification, and a confidence that the candidate text data belongs to a domain corresponding to the third domain identification. If the confidence level is greater than a preset threshold value, the candidate text data may be considered to belong to the third domain. Then, a third domain speech recognition model corresponding to the third domain identifier may be used as a domain speech recognition model for recognizing the speech data to be recognized.

If the equipment identifier of the terminal is queried, judging whether the second receiving time corresponding to the terminal identifier is within a preset duration before the first receiving time in the stored corresponding relation of the equipment identifier, the field identifier and the receiving time. If so, according to the second domain identifier corresponding to the equipment identifier and the second receiving time and the corresponding relation between the domain identifier and the domain voice recognition model, inquiring a second domain voice recognition model corresponding to the second domain identifier, wherein the second domain voice recognition model is determined as a domain voice recognition model for recognizing voice data to be recognized. In this case, the server may record information related to the current speech recognition request. Because the second domain voice recognition model corresponding to the second domain identifier is used for voice recognition of the voice data to be recognized, the queried receiving time in the corresponding relation among the equipment identifier, the second domain identifier and the receiving time is only required to be updated into the first receiving time.

In the second case, if the second receiving time corresponding to the device identifier of the terminal is not queried, the following process may be performed: and inputting the voice data to be recognized into a universal field voice recognition model to obtain candidate text data. And inputting the candidate text data into a domain identification model to obtain a third domain identifier corresponding to the candidate text data and a confidence that the candidate text data belongs to the domain corresponding to the third domain identifier. If the confidence coefficient is larger than the preset threshold value, a third domain voice recognition model corresponding to the third domain identifier is determined based on the corresponding relation between the pre-stored domain identifier and the domain voice recognition model, and the third domain voice recognition model is determined to be the domain voice recognition model for recognizing the voice data to be recognized.

The domain recognition model is a model for recognizing a domain to which text data belongs, and the model can be a machine learning model obtained by training a large number of text data and domain identifiers corresponding to the text data as samples.

Step 305, the server determines the result text data corresponding to the voice data to be recognized based on the domain voice recognition model for recognizing the voice data to be recognized.

The output text data is different according to different functions of the field voice recognition model of the voice data to be recognized, and can be Chinese characters, english characters, japanese characters and the like.

In implementation, for the case that the confidence coefficient obtained after the candidate text data is input into the domain recognition model in the step 304 is smaller than the preset threshold, the universal domain speech recognition model may be used as the domain speech recognition model for recognizing the speech data to be recognized this time, and then, correspondingly, the candidate text data may be directly determined as the result text data.

In the case where the third domain speech recognition model is determined as the domain speech recognition model for recognizing the speech data to be recognized, the speech data to be recognized may be input into the third domain speech recognition model in this step, resulting in text data.

In one possible implementation manner, for the situation that the confidence coefficient obtained after the candidate text data is input into the domain recognition model is greater than the preset threshold, the result text data obtained in step 305 and the voice data to be recognized corresponding to the result text data may be used as a set of training samples. And the training sample and the domain identifier corresponding to the domain voice recognition model for obtaining the result text data in the training sample can be correspondingly stored. When the number of training samples corresponding to any domain identifier reaches a preset number, the training samples corresponding to the domain identifier can be obtained, and training and updating are carried out on the domain voice recognition model corresponding to the domain identifier. Then, when the domain voice recognition model corresponding to the domain identifier is used again for voice recognition, the domain voice recognition model after training and updating can be used.

And 306, the server sends the result text data to the terminal.

In the implementation, the server returns the recognized result text data to the terminal, and meanwhile, when the voice data to be recognized is recognized, the domain identifier corresponding to the used domain voice recognition model is also returned to the terminal, and the terminal can display the domain identifier to the domain to which the user current voice recognition belongs.

Any combination of the above-mentioned optional solutions may be adopted to form an optional embodiment of the present disclosure, which is not described herein in detail.

Based on the same technical concept, the embodiment of the application also provides a device for voice recognition, which is applied to a server and comprises: a receiving module 410, a determining module 420, an identifying module 430 and a transmitting module 440.

A receiving module 410, configured to receive a voice recognition request sent by a terminal, where the voice recognition request carries voice data to be recognized and a corresponding first domain identifier;

A determining module 420, configured to determine a domain speech recognition model for recognizing the speech data to be recognized based on the first domain identifier and a correspondence between a pre-stored domain identifier and a domain speech recognition model;

The recognition module 430 is configured to determine, based on the domain voice recognition model for recognizing the voice data to be recognized, result text data corresponding to the voice data to be recognized;

And a sending module 440, configured to send the result text data to the terminal.

Optionally, the determining module 420 is configured to:

Optionally, the voice recognition request carries an equipment identifier of the terminal, and the receiving module 410 is further configured to:

Determining a first time of receipt of the speech recognition request;

the apparatus further comprises:

Determining a first time of receipt of the speech recognition request;

The determining module 420 is configured to:

Optionally, the determining module 420 is further configured to:

Optionally, the apparatus further includes:

Based on the same technical concept, the embodiment of the application also provides a device for voice recognition, which is applied to a terminal and comprises: the system comprises an acquisition module 510, a determination module 520, a transmission module 530 and a reception module 540.

An obtaining module 510, configured to obtain voice data to be recognized;

A determining module 520, configured to determine a target domain identifier to which the voice data to be identified belongs;

A sending module 530, configured to send the voice recognition request to a server, where the voice recognition request carries the voice data to be recognized and the target domain identifier, where the voice recognition request is used to instruct the server to determine a domain voice recognition model for recognizing the voice data to be recognized based on the target domain identifier and a correspondence between a pre-stored domain identifier and a domain voice recognition model, and determine result text data corresponding to the voice data to be recognized based on the domain voice recognition model for recognizing the voice data to be recognized;

and a receiving module 540, configured to receive the result text data sent by the server.

Optionally, the determining module 520 is configured to:

Receiving a domain selection instruction input by a user;

Optionally, the determining module 520 is configured to:

It should be noted that: in the voice recognition device provided in the above embodiment, only the division of the above functional modules is used for illustration, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the server is divided into different functional modules to perform all or part of the functions described above. In addition, the device for voice recognition provided in the above embodiment and the method embodiment for voice recognition belong to the same concept, and the specific implementation process is detailed in the method embodiment, which is not repeated here.

Fig. 6 shows a block diagram of a terminal 600 according to an exemplary embodiment of the present application. The terminal 600 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion picture expert compression standard audio plane 3), an MP4 (Moving Picture Experts Group Audio Layer IV, motion picture expert compression standard audio plane 4) player, a notebook computer, or a desktop computer. Terminal 600 may also be referred to by other names of user devices, portable terminals, laptop terminals, desktop terminals, etc.

In general, the terminal 600 includes: a processor 601 and a memory 602.

Processor 601 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 601 may be implemented in at least one hardware form of DSP (DIGITAL SIGNAL Processing), FPGA (Field-Programmable gate array), PLA (Programmable Logic Array ). Processor 601 may also include a main processor, which is a processor for processing data in an awake state, also referred to as a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 601 may integrate a GPU (Graphics Processing Unit, image processor) for rendering and drawing of content required to be displayed by the display screen. In some embodiments, the processor 601 may also include an AI (ARTIFICIAL INTELLIGENCE ) processor for processing computing operations related to machine learning.

The memory 602 may include one or more computer-readable storage media, which may be non-transitory. The memory 602 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 602 is used to store at least one instruction for execution by processor 601 to implement the method of speech recognition provided by the method embodiments of the present application.

In some embodiments, the terminal 600 may further optionally include: a peripheral interface 603, and at least one peripheral. The processor 601, memory 602, and peripheral interface 603 may be connected by a bus or signal line. The individual peripheral devices may be connected to the peripheral device interface 603 via buses, signal lines or a circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 604, a touch display 605, a camera 606, audio circuitry 607, a positioning component 608, and a power supply 609.

Peripheral interface 603 may be used to connect at least one Input/Output (I/O) related peripheral to processor 601 and memory 602. In some embodiments, the processor 601, memory 602, and peripheral interface 603 are integrated on the same chip or circuit board; in some other embodiments, either or both of the processor 601, memory 602, and peripheral interface 603 may be implemented on separate chips or circuit boards, which is not limited in this embodiment.

The Radio Frequency circuit 604 is configured to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The radio frequency circuit 604 communicates with a communication network and other communication devices via electromagnetic signals. The radio frequency circuit 604 converts an electrical signal into an electromagnetic signal for transmission, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 604 includes: antenna systems, RF transceivers, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and so forth. The radio frequency circuit 604 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: metropolitan area networks, various generations of mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (WIRELESS FIDELITY ) networks. In some embodiments, the radio frequency circuit 604 may further include NFC (NEAR FIELD Communication) related circuits, which is not limited by the present application.

The display screen 605 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display 605 is a touch display, the display 605 also has the ability to collect touch signals at or above the surface of the display 605. The touch signal may be input as a control signal to the processor 601 for processing. At this point, the display 605 may also be used to provide virtual buttons and/or virtual keyboards, also referred to as soft buttons and/or soft keyboards. In some embodiments, the display 605 may be one, providing a front panel of the terminal 600; in other embodiments, the display 605 may be at least two, respectively disposed on different surfaces of the terminal 600 or in a folded design; in still other embodiments, the display 605 may be a flexible display, disposed on a curved surface or a folded surface of the terminal 600. Even more, the display 605 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The display 605 may be made of LCD (Liquid CRYSTAL DISPLAY), OLED (Organic Light-Emitting Diode), or other materials.

The camera assembly 606 is used to capture images or video. Optionally, the camera assembly 606 includes a front camera and a rear camera. Typically, the front camera is disposed on the front panel of the terminal and the rear camera is disposed on the rear surface of the terminal. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and Virtual Reality (VR) shooting function or other fusion shooting functions. In some embodiments, camera assembly 606 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and can be used for light compensation under different color temperatures.

The audio circuit 607 may include a microphone and a speaker. The microphone is used for collecting sound waves of users and environments, converting the sound waves into electric signals, and inputting the electric signals to the processor 601 for processing, or inputting the electric signals to the radio frequency circuit 604 for voice communication. For the purpose of stereo acquisition or noise reduction, a plurality of microphones may be respectively disposed at different portions of the terminal 600. The microphone may also be an array microphone or an omni-directional pickup microphone. The speaker is used to convert electrical signals from the processor 601 or the radio frequency circuit 604 into sound waves. The speaker may be a conventional thin film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only the electric signal can be converted into a sound wave audible to humans, but also the electric signal can be converted into a sound wave inaudible to humans for ranging and other purposes. In some embodiments, the audio circuit 607 may also include a headphone jack.

The location component 608 is utilized to locate the current geographic location of the terminal 600 to enable navigation or LBS (Location Based Service, location-based services). The positioning component 608 may be a positioning component based on the United states GPS (Global Positioning System ), the Beidou system of China, the Granati system of Russia, or the Galileo system of the European Union.

A power supply 609 is used to power the various components in the terminal 600. The power source 609 may be alternating current, direct current, disposable battery or rechargeable battery. When the power source 609 includes a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, the terminal 600 further includes one or more sensors 610. The one or more sensors 610 include, but are not limited to: acceleration sensor 611, gyroscope sensor 612, pressure sensor 613, fingerprint sensor 614, optical sensor 615, and proximity sensor 616.

The acceleration sensor 611 can detect the magnitudes of accelerations on three coordinate axes of the coordinate system established with the terminal 600. For example, the acceleration sensor 611 may be used to detect components of gravitational acceleration in three coordinate axes. The processor 601 may control the touch display screen 605 to display a user interface in a landscape view or a portrait view according to the gravitational acceleration signal acquired by the acceleration sensor 611. The acceleration sensor 611 may also be used for the acquisition of motion data of a game or a user.

The gyro sensor 612 may detect a body direction and a rotation angle of the terminal 600, and the gyro sensor 612 may collect a 3D motion of the user on the terminal 600 in cooperation with the acceleration sensor 611. The processor 601 may implement the following functions based on the data collected by the gyro sensor 612: motion sensing (e.g., changing UI according to a tilting operation by a user), image stabilization at shooting, game control, and inertial navigation.

The pressure sensor 613 may be disposed at a side frame of the terminal 600 and/or at a lower layer of the touch screen 605. When the pressure sensor 613 is disposed at a side frame of the terminal 600, a grip signal of the terminal 600 by a user may be detected, and a left-right hand recognition or a shortcut operation may be performed by the processor 601 according to the grip signal collected by the pressure sensor 613. When the pressure sensor 613 is disposed at the lower layer of the touch display screen 605, the processor 601 controls the operability control on the UI interface according to the pressure operation of the user on the touch display screen 605. The operability controls include at least one of a button control, a scroll bar control, an icon control, and a menu control.

The fingerprint sensor 614 is used to collect a fingerprint of a user, and the processor 601 identifies the identity of the user based on the fingerprint collected by the fingerprint sensor 614, or the fingerprint sensor 614 identifies the identity of the user based on the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, the processor 601 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying for and changing settings, etc. The fingerprint sensor 614 may be provided on the front, back, or side of the terminal 600. When a physical key or vendor Logo is provided on the terminal 600, the fingerprint sensor 614 may be integrated with the physical key or vendor Logo.

The optical sensor 615 is used to collect ambient light intensity. In one embodiment, processor 601 may control the display brightness of touch display 605 based on the intensity of ambient light collected by optical sensor 615. Specifically, when the intensity of the ambient light is high, the display brightness of the touch display screen 605 is turned up; when the ambient light intensity is low, the display brightness of the touch display screen 605 is turned down. In another embodiment, the processor 601 may also dynamically adjust the shooting parameters of the camera assembly 606 based on the ambient light intensity collected by the optical sensor 615.

A proximity sensor 616, also referred to as a distance sensor, is typically provided on the front panel of the terminal 600. The proximity sensor 616 is used to collect the distance between the user and the front of the terminal 600. In one embodiment, when the proximity sensor 616 detects a gradual decrease in the distance between the user and the front face of the terminal 600, the processor 601 controls the touch display 605 to switch from the bright screen state to the off screen state; when the proximity sensor 616 detects that the distance between the user and the front surface of the terminal 600 gradually increases, the processor 601 controls the touch display screen 605 to switch from the off-screen state to the on-screen state.

Those skilled in the art will appreciate that the structure shown in fig. 6 is not limiting of the terminal 600 and may include more or fewer components than shown, or may combine certain components, or may employ a different arrangement of components.

Fig. 7 is a schematic structural diagram of a server according to an embodiment of the present application, where the server 700 may have a relatively large difference due to different configurations or performances, and may include one or more processors (centra l process i ng units, CPU) 701 and one or more memories 702, where at least one instruction is stored in the memories 702, and the at least one instruction is loaded and executed by the processors 701 to implement the method of speech recognition provided in the foregoing method embodiments. Of course, the server may also have a wired or wireless network interface, a keyboard, an input/output interface, and other components for implementing the functions of the device, which are not described herein.

In an exemplary embodiment, a computer readable storage medium, such as a memory comprising instructions executable by a processor in a terminal to perform the method of speech recognition in the embodiments described below, is also provided. For example, the computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments of the application is not intended to limit the application to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the application are intended to be included within the scope of the application.

Claims

1. A method of speech recognition, the method being applied to a server, the method comprising:

receiving a voice recognition request sent by a terminal, wherein the voice recognition request carries voice data to be recognized, a corresponding first field identifier and a device identifier of the terminal;

Determining a first time of receipt of the speech recognition request;

If the second receiving time exists, determining a second domain identifier corresponding to the second receiving time, determining a second domain voice recognition model corresponding to the second domain identifier based on a corresponding relation between the pre-stored domain identifier and a domain voice recognition model, determining the second domain voice recognition model as a domain voice recognition model for recognizing the voice data to be recognized, and updating the stored second receiving time corresponding to the equipment identifier of the terminal to the first receiving time;

and sending the result text data to the terminal.

2. The method according to claim 1, wherein the determining a domain voice recognition model for recognizing the voice data to be recognized based on the first domain identifier and a correspondence between a pre-stored domain identifier and a domain voice recognition model includes:

3. The method according to claim 2, wherein the method further comprises:

4. A method according to any one of claims 1-3, characterized in that the method further comprises:

5. The method according to claim 4, wherein the method further comprises:

correspondingly storing the training sample and the third domain identifier;

6. A method of speech recognition, the method being applied to a terminal, the method comprising:

acquiring voice data to be recognized;

Determining a first domain identifier to which the voice data to be recognized belongs;

A voice recognition request is sent to a server, wherein the voice recognition request carries the voice data to be recognized, the first domain identifier and the equipment identifier of the terminal, the voice recognition request is used for indicating the server to determine a first receiving time of the voice recognition request, if a domain voice recognition model corresponding to the first domain identifier does not exist in a corresponding relation between the pre-stored domain identifier and the domain voice recognition model, whether a second receiving time corresponding to the equipment identifier of the terminal in a preset time before the first receiving time is stored is determined, if the second receiving time exists, the second domain identifier corresponding to the second receiving time is determined, a second domain voice recognition model corresponding to the second domain identifier is determined based on a corresponding relation between the pre-stored domain identifier and the domain voice recognition model, the second domain voice recognition model is determined to be used for recognizing the voice data to be recognized, the stored second receiving time corresponding to the equipment identifier of the terminal is updated to be the first receiving time, and the voice data to be recognized is determined based on the voice recognition result of the voice data to be recognized;

And receiving the result text data sent by the server.

7. The method of claim 6, wherein the determining the first domain identifier corresponding to the voice data to be recognized comprises:

Receiving a domain selection instruction input by a user;

8. The method of claim 6, wherein the determining the first domain identifier corresponding to the voice data to be recognized comprises:

9. An apparatus for speech recognition, the apparatus being applied to a server, the apparatus comprising:

The receiving module is used for receiving a voice recognition request sent by a terminal, wherein the voice recognition request carries voice data to be recognized, a corresponding first field identifier and a device identifier of the terminal;

A determining module, configured to determine a first receiving time of the speech recognition request; if the domain voice recognition model corresponding to the first domain identifier does not exist in the corresponding relation between the pre-stored domain identifier and the domain voice recognition model, determining whether a second receiving time corresponding to the equipment identifier of the terminal in a preset time period before the first receiving time is stored; if the second receiving time exists, determining a second domain identifier corresponding to the second receiving time, determining a second domain voice recognition model corresponding to the second domain identifier based on a corresponding relation between the pre-stored domain identifier and a domain voice recognition model, determining the second domain voice recognition model as a domain voice recognition model for recognizing the voice data to be recognized, and updating the stored second receiving time corresponding to the equipment identifier of the terminal to the first receiving time;

10. An apparatus for speech recognition, the apparatus being applied to a terminal, the apparatus comprising:

the acquisition module is used for acquiring voice data to be identified;

the determining module is used for determining a first domain identifier to which the voice data to be recognized belong;

a sending module, configured to send a voice recognition request to a server, where the voice recognition request carries the voice data to be recognized, the first domain identifier and the device identifier of the terminal, where the voice recognition request is used to instruct the server to determine a first receiving time of the voice recognition request, if a domain voice recognition model corresponding to the first domain identifier does not exist in a correspondence between pre-stored domain identifiers and domain voice recognition models, determine whether a second receiving time corresponding to the device identifier of the terminal in a preset duration before the first receiving time exists, if the second receiving time exists, determine a second domain identifier corresponding to the second receiving time, determine a second domain voice recognition model corresponding to the second domain identifier based on a correspondence between pre-stored domain identifiers and domain voice recognition models, determine the second domain voice recognition model corresponding to the second domain voice recognition model, determine the second domain voice recognition model as a domain voice recognition model for recognizing the voice data to be recognized, update the stored second receiving time corresponding to the device identifier of the terminal as the second receiving time corresponding to the first domain voice recognition model, and determine a voice text corresponding to the voice recognition data based on the voice recognition result to be recognized by the pre-stored domain voice recognition model;