CN115547337B

CN115547337B - Speech recognition method and related product

Info

Publication number: CN115547337B
Application number: CN202211487069.2A
Authority: CN
Inventors: 祝明; 王曦
Original assignee: Shenzhen Renma Interactive Technology Co Ltd
Current assignee: Shenzhen Renma Interactive Technology Co Ltd
Priority date: 2022-11-25
Filing date: 2022-11-25
Publication date: 2023-03-03
Anticipated expiration: 2042-11-25
Also published as: CN115547337A

Abstract

The application provides a voice recognition method and a related product, wherein the method comprises the following steps: the server calls a human-computer interaction engine to interact with a user through terminal equipment, target voice information input by the user in an interaction process is obtained, character recognition is carried out on the target voice information, a first text is obtained, scene recognition and scene associated word extraction are carried out on the first text, a target service scene and a target scene associated word corresponding to the first text are determined, pinyin comparison is carried out on the target scene associated word and a scene associated word in a target scene hot word set corresponding to the target service scene, difference value scores of the target scene associated word and the scene hot word are obtained, the target scene associated word in the first text is replaced by the target scene hot word with the highest difference value score in the target scene hot word set, a second text is obtained, and corresponding service operation is executed according to user intention of the second text. Therefore, the accuracy of voice recognition can be improved, and user experience is improved.

Description

Speech recognition method and related product

Technical Field

The application belongs to the technical field of general data processing of the Internet industry, and particularly relates to a voice recognition method and a related product.

Background

With the development of the internet industry, voice interaction is performed between a user and a device such as a mobile phone, and corresponding services are provided for the user based on voice information input by the user in the interaction process, so that voice recognition is very important in order to ensure that the services meet the requirements of the user. At present, when a merchant conducts voice recognition, due to the fact that pronunciation of a user is not standard or homophones exist, a voice recognition result is inaccurate.

Disclosure of Invention

The application provides a voice recognition method and a related product, so as to improve the accuracy of voice recognition and improve user experience.

In a first aspect, an embodiment of the present application provides a speech recognition method, which is applied to a server in a speech recognition system, where the speech recognition system includes the server and a terminal device for performing speech interaction with a user, the server includes a human-computer interaction engine supporting human-computer speech interaction, and the method includes:

calling the human-computer interaction engine to interact with the user through the terminal equipment, and acquiring target voice information input by the user in the interaction process; performing character recognition on the target voice information to obtain a first text;

performing scene recognition on the first text, and determining a target service scene corresponding to the first text, wherein the target service scene is used for representing a service type which is expressed by the first text and needs to be provided;

extracting scene associated words from the first text to obtain target scene associated words corresponding to the first text, wherein the target scene associated words are used for representing the service content of the service type required to be provided and expressed by the first text;

performing scene hot word set query according to the target service scene to obtain a target scene hot word set corresponding to the target service scene;

performing pinyin comparison on the target scene associated word and the scene hot words in the target scene hot word set to obtain a difference value score between the target scene associated word and the scene hot words in the target scene hot word set, wherein the scene hot words are vocabularies with the heat degree greater than a heat degree threshold value, and the heat degree refers to the query heat degree of the vocabularies in all users;

determining a target scene hot word with the highest difference value score in the target scene hot word set;

replacing the target scene associated words in the first text with the target scene hot words to obtain a second text;

determining the user intention expressed by the target voice information according to the second text; and (c) a second step of,

and executing corresponding service operation according to the determined user intention.

In a second aspect, an embodiment of the present application provides a speech recognition apparatus, which is applied to a server in a speech recognition system, where the speech recognition system includes a server and a terminal device for performing speech interaction with a user, the server includes a human-computer interaction engine supporting human-computer speech interaction, and the apparatus includes:

the acquisition unit is used for calling the human-computer interaction engine to interact with the user through the terminal equipment, and acquiring target voice information input by the user in the interaction process; performing character recognition on the target voice information to obtain a first text;

a scene recognition unit, configured to perform scene recognition on the first text, and determine a target service scene corresponding to the first text, where the target service scene is used to represent a service type that needs to be provided and is expressed by the first text;

the scene associated word extracting unit is used for extracting a scene associated word from the first text to obtain a target scene associated word corresponding to the first text, wherein the target scene associated word is used for representing the service content of the service type required to be provided and expressed by the first text;

the scene hot word set query unit is used for carrying out scene hot word set query according to the target service scene to obtain a target scene hot word set corresponding to the target service scene;

the comparison unit is used for performing pinyin comparison on the target scene associated word and the scene hot words in the target scene hot word set to obtain a difference value score between the target scene associated word and the scene hot words in the target scene hot word set, wherein the scene hot words are words with the heat degree greater than a heat degree threshold value, and the heat degree refers to the query heat degree of the words in all users;

the first determining unit is used for determining a target scene hotword with the highest score of the difference value in the target scene hotword set;

the replacing unit is used for replacing the target scene associated words in the first text with the target scene hot words to obtain a second text;

a second determining unit, configured to determine, according to the second text, a user intention expressed by the target speech information; and the number of the first and second groups,

and the service unit is used for executing corresponding service operation according to the determined user intention.

In a third aspect, an embodiment of the present application provides an electronic device, including a processor, a memory, and one or more programs, stored in the memory and configured to be executed by the processor, where the program includes instructions for performing the steps in the method according to the first aspect of the embodiment of the present application.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, on which a computer program/instructions are stored, which, when executed by a processor, implement the steps of the method according to the first aspect of the embodiments of the present application.

In a fifth aspect, the present application provides a computer program product, which includes a computer program/instruction, and when executed by a processor, implements the steps of the method according to the first aspect of the present application.

In the embodiment of the application, a server firstly calls a human-computer interaction engine to interact with a user through terminal equipment, target voice information input by the user in an interaction process is obtained, character recognition is carried out on the target voice information to obtain a first text, scene recognition and scene associated word extraction are carried out on the first text, a target service scene and a target scene associated word corresponding to the first text are determined, pinyin comparison is carried out on the target scene associated word and a scene associated word in a target scene hot word set corresponding to the target service scene to obtain a difference score between the target scene associated word and the scene hot word, the target scene associated word in the first text is replaced by the target scene hot word with the highest difference score in the target scene hot word set to obtain a second text, a user intention expressed by the target voice information is determined according to the second text, and finally corresponding service operation is executed according to the user intention. Therefore, the server can interact with the user through the terminal equipment, carries out voice recognition on voice information input by the user in the interaction process to obtain a first text, sequentially executes operations of scene recognition of the first text, scene associated word extraction of the first text, pinyin comparison between the scene associated words and scene hot words and the like to correct the first text, obtains a corrected second text, avoids the situation that the pronunciation of the user is not standard or the voice recognition result is inaccurate in a homophone scene, is favorable for improving the accuracy of the voice recognition, and improves the experience of the user.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a block diagram of a speech recognition system according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a speech recognition method according to an embodiment of the present application;

fig. 3a is a schematic flowchart of a process for obtaining a difference score between a target scene related word and a scene hotword in a target scene hotword set according to an embodiment of the present application;

fig. 3b is a schematic diagram of a first server interacting with a terminal device according to an embodiment of the present application;

fig. 3c is a schematic diagram of a second server interacting with a terminal device according to an embodiment of the present application;

fig. 3d is a schematic diagram illustrating interaction between a third server and a terminal device according to an embodiment of the present application;

fig. 3e is a schematic diagram of a fourth server interacting with a terminal device according to an embodiment of the present application;

fig. 3f is a schematic diagram of a fifth server interacting with a terminal device according to an embodiment of the present application;

fig. 4 is a block diagram illustrating functional units of a speech recognition apparatus according to an embodiment of the present disclosure;

FIG. 5 is a block diagram illustrating functional units of another speech recognition apparatus according to an embodiment of the present application;

fig. 6 is a block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," and the like in the description and claims of the present application and in the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

First, a system architecture according to an embodiment of the present application will be described.

Referring to fig. 1, fig. 1 is a block diagram of a speech recognition system according to an embodiment of the present disclosure. As shown in fig. 1, the voice recognition system 10 includes a server 11 and a terminal device 12 for performing voice interaction with a user, the server 11 is in communication connection with the terminal device 12, the server 11 includes a human-computer interaction engine supporting human-computer voice interaction, the server 11 interacts with the user through the terminal device 12 by calling the human-computer interaction engine to obtain target voice information input by the user in an interaction process, performs character recognition on the target voice information to obtain a first text, and then analyzes a user intention expressed by the target voice information according to the first text; and executing corresponding service operation according to the determined user intention. The server 11 may be a server, a server cluster composed of a plurality of servers, or a cloud computing service center, and the terminal device 12 may be a mobile phone terminal, a tablet computer, a notebook computer, or the like.

Based on this, the embodiments of the present application provide a speech recognition method, and the following describes the embodiments of the present application in detail with reference to the drawings.

Referring to fig. 2, fig. 2 is a flowchart illustrating a speech recognition method according to an embodiment of the present application, where the method is applied to a server 11 in a speech recognition system 10 shown in fig. 1, the speech recognition system 10 includes the server 11 and a terminal device 12 for performing speech interaction with a user, the server 11 includes a human-machine interaction engine supporting speech recognition, and as shown in fig. 2, the method includes:

step 201, calling the human-computer interaction engine to interact with the user through the terminal equipment, and acquiring target voice information input by the user in the interaction process; and performing character recognition on the target voice information to obtain a first text.

Step 202, performing scene recognition on the first text, and determining a target service scene corresponding to the first text.

The target service scene is used for representing the type of the service which needs to be provided and is expressed by the first text. The service scene may be, but is not limited to, one of a song listening service scene, a reading service scene, a video service scene, and a navigation service scene.

For example, when the service scenario is a song listening service scenario, the service type may be a song playing service, when the service scenario is a novel reading service scenario, the service type may be a novel pushing service, when the service scenario is a shopping service scenario, the service type may be a commodity pushing service, and when the service scenario is a navigation service scenario, the service type may be a navigation service.

Step 203, performing scene associated word extraction on the first text to obtain a target scene associated word corresponding to the first text.

The target scene associated word is used for representing the service content of the service type which is expressed by the first text and needs to be provided.

Illustratively, the service type is a song playing service, the scenario associated word is a name of a person, and the service content is that the server pushes a song associated with the name of the person to the terminal device so as to be played by the terminal device, at this time, the name of the person may be, but is not limited to, one of a singer, a writer, and a composer of the song, and is not specifically limited; the service type is a novel push service, the scene associated word is a name of a person, the service content is that the server pushes novel information associated with the name of the person to the terminal device so as to be displayed by the terminal device, and at this time, the name of the person can indicate one of an author, a recommender and an artist of the novel, but is not limited specifically; the service type is commodity pushing service, the scene associated word is a commodity name, and the service content is that the server pushes commodity information associated with the commodity name to the terminal equipment so as to be displayed by the terminal equipment conveniently; the service type is navigation service, the scene associated word is a place name, and the service content is that the server pushes navigation information associated with the place name to the terminal equipment so that the terminal equipment can conveniently navigate to the place name for the user.

And 204, inquiring a scene hot word set according to the target service scene to obtain a target scene hot word set corresponding to the target service scene.

The corresponding relation between various service scenes and the scene hot word set is preset.

Step 205, performing pinyin comparison on the target scene associated word and the scene hotword in the target scene hotword set to obtain a difference value score between the target scene associated word and the scene hotword in the target scene hotword set.

The scene hotword is a vocabulary with the heat degree larger than a heat degree threshold value, and the heat degree in the application refers to the query heat degree of the vocabulary in all users. The higher the popularity of the vocabulary is, the more the query times of all users for the vocabulary is, and conversely, the lower the popularity of the vocabulary is, the less the query times of all users for the vocabulary is.

It should be noted that the larger the difference score between two vocabularies, the higher the similarity between the two vocabularies, i.e. the more similar the two vocabularies are, whereas the smaller the difference score between the two vocabularies, the lower the similarity between the two vocabularies, i.e. the greater the difference between the two vocabularies is.

Step 206, determining the target scene hotword with the highest difference value score in the target scene hotword set.

And step 207, replacing the target scene associated word in the first text with the target scene hot word to obtain a second text.

And step 208, determining the user intention expressed by the target voice information according to the second text.

And step 209, executing corresponding service operation according to the determined user intention.

The target scene hot word set may include scene hot words that are the same as the target scene related words, and the target scene hot word set may not include scene hot words that are the same as the target scene related words.

In the embodiment of the application, a server firstly calls a human-computer interaction engine to interact with a user through terminal equipment, target voice information input by the user in an interaction process is obtained, character recognition is carried out on the target voice information to obtain a first text, scene recognition and scene associated word extraction are carried out on the first text, a target service scene and a target scene associated word corresponding to the first text are determined, pinyin comparison is carried out on the target scene associated word and a scene associated word in a target scene hot word set corresponding to the target service scene to obtain a difference score between the target scene associated word and the scene hot word, the target scene associated word in the first text is replaced by the target scene hot word with the highest difference score in the target scene hot word set to obtain a second text, a user intention expressed by the target voice information is determined according to the second text, and finally corresponding service operation is executed according to the user intention. Therefore, the server can interact with the user through the terminal equipment, carries out voice recognition on voice information input by the user in the interaction process to obtain a first text, sequentially executes operations of scene recognition of the first text, scene associated word extraction of the first text, pinyin comparison between the scene associated words and scene hot words and the like to correct the first text, obtains a corrected second text, avoids the condition that the pronunciation of the user is not standard or the voice recognition result is inaccurate under the homophone scene, is favorable for improving the accuracy of the voice recognition, and improves the experience of the user.

For convenience of understanding, a process of obtaining a difference value score between a target scene related word and a scene hotword in a target scene hotword set in the embodiment of the present application will be described below.

Referring to fig. 3a, fig. 3a is a schematic flowchart of a process for obtaining a difference score between a target scene associated word and a scene hotword in a target scene hotword set according to an embodiment of the present disclosure, and as shown in fig. 3a, the process a for obtaining a difference score between a target scene associated word and a scene hotword in a target scene hotword set includes:

step 301, determining whether a first vocabulary completely identical to the pinyin of the target scene associated word exists in the target scene hot word set.

If so, go to step 302.

Step 302, determining whether the number of the first vocabulary is greater than 1.

After step 302, if yes, step 303 is performed.

Step 303, determining whether the user has inquired about the first vocabulary ever.

After step 303, if yes, go to step 304.

And step 304, determining whether the number of the second words queried in the first words is larger than 1.

After step 304, if yes, step 305 is performed.

And 305, determining whether the time interval between the query time of the user for each second vocabulary and the current time is greater than a preset interval.

The preset interval may be, but not limited to, 10 days, 15 days, 30 days, etc., and is not particularly limited.

After step 305, if yes, step 306 is executed.

Step 306, determining that the difference value score of the scene hot word with the largest query frequency in the second vocabulary is the highest.

For example, the preset time interval is 10 days, the target scene related word is "apor", the target scene hot word set is a scene hot word set B, 4 first words "ashan", "arsal", "arauca" and "argol" identical to the pinyin of "apor" exist in the scene hot word set B, the query time of the user on "arsal" is 11 days from the current time, the query time of the user on "argo" is 12 days from the current time, the query time of the user on "arauca" is 5 times from the "arsal", the query time of the user on "argo" is 1 time from the "alaica", and no query records of the user on "ashan" and "arauca" are recorded. In the above process of obtaining the difference score between the target scene related word and the scene hot word in the target scene hot word set, it is first determined that "ashan", "arse", "argos", and "argos" identical to the pinyin of "arse" exist in the scene hot word set B, then it is determined that the number 4 of "ashan", "arse", and "argos" is greater than 1, then it is determined that the user has performed a query on "arse", and the user has also performed a query on "arse", and then it is determined that the number 2 of "arse", "arse" that have been queried is greater than 1, and then it is determined that the query time of the user on "arse" is greater than 10 days from the current time 11 days, and the query time of the user on "arse" is greater than 10 days from the current time 12 days, and then it is determined that the difference score of "arse" with the greatest number of queries in "arse" and "arse" is the highest. For example, with reference to fig. 3b in combination with a specific application scenario, fig. 3b is a schematic diagram of interaction between a first server and a terminal device provided in an embodiment of the present application, where the server asks a user: asking what help is needed; if the terminal equipment acquires the target voice information input by the user, the target voice information comprises: i want to see the fiction of deleting; the server obtains a second text "i want to see the fiction of asharad" based on the above flow, determines that the intention of the user is to see the fiction of asharad according to the second text, and pushes a first page to the terminal device, wherein the first page comprises a website link www of the fiction of asharad. The user operation prompt message of "preferably, the target scene hotword" arsan "in the user operation prompt message can be highlighted, such as bolded, darkened, etc.; the terminal equipment displays the first page; the user clicks "www. Times.com" in this first page to fetch the novels of arsal for reading.

As an optional flow branch, the flow a further includes after the step 305, if no, executing a step 307.

Step 307, determining that the difference value score of the scene hot word with the shortest time interval between the query time and the current time in the second vocabulary is the highest.

For example, the preset time interval is 5 days, the target scene related word is "ashan", the target scene hot word set is a scene hot word set a, 3 first words "ashan", "arsal" and "argy" identical to the pinyin of "ashan" exist in the scene hot word set a, the query time of the user for "ashan" is 3 days from the current time, the query time of the user for "arsal" is 4 days from the current time, and the query time of the user for "argy" is 7 days from the current time. In the above process of obtaining the difference score between the target scene related word and the scene hot word in the target scene hot word set, it is determined that "ashan", "arse" and "arje" which are identical to the pinyin of "ashan" exist in the scene hot word set a, then the number 3 of "ashan", "arse" and "arje" is determined to be greater than 1, then it is determined that the user has performed queries on "ashan", "arse" and "arje", respectively, then it is determined that the number 3 of "ashan", "arse" and "arje" which have been queried is determined to be greater than 1, then it is determined that the query time of the user to ashan is less than 5 days from the current time for 3 days, and the query time of the user to asha is less than 5 days from the current time for 4 days, and then it is determined that the difference score of "ashan" in which the query time interval between the query time of the user to ashan "and the current time is the shortest. Illustratively, in conjunction with a particular application scenario, the server asks the user: asking what service is needed, if the terminal device obtains the target voice information input by the user: the server obtains a second text 'i want to listen to the songs of the ashan' based on the process, and determines that the subsequent reply is as follows according to the second text: the Yashan song is played for the user, and meanwhile the server can push the Yashan song to the terminal device, so that the terminal device can play the Yashan song conveniently.

As another optional flow branch, after the step 302, if the number of the first vocabulary is equal to 1, the flow a further includes determining that the difference score of the first vocabulary is the highest.

For example, the target scene related word is "delete", the target scene hot word set is a scene hot word set C, and 1 first word "argy" identical to the pinyin of "delete" exists in the scene hot word set C. Based on the above-mentioned process of obtaining the difference score between the related word in the target scene and the scene hot word in the target scene hot word set, first, it is determined that the first word "argos" identical to the pinyin of "argy" exists in the scene hot word set C, and finally, it is determined that the number 1 of "argos" is equal to 1, and then it is determined that the difference score of "argos" is the highest. For example, referring to fig. 3c in combination with a specific application scenario, fig. 3c is a schematic diagram of interaction between a second server and a terminal device provided in an embodiment of the present application, where the server asks a user: asking what service is needed, if the terminal device obtains the target voice information input by the user: i want to listen to the song of argy, the server obtains a second text "i want to listen to the song of argy" based on the above process, determines that the user's intention is to listen to the song of argy according to the second text, and pushes a second page to the terminal device, where the second page includes a user prompt message like "the song of argy will be played for you", and preferably, the target scene hotword "argy" in the user prompt message can be highlighted, such as bold, deepened color, enlarged font, etc.; the terminal equipment displays the second page; then, the server pushes the song of ARARS to the terminal equipment; the terminal device plays the song of ARARS pushed by the server.

As another optional flow branch, after step 303, if the user has not queried the first vocabulary, the flow a further includes determining that the difference score of the scene hotword with the highest degree in the first vocabulary is the highest.

For example, the target scene related word is "assan", the target scene hot word set is a scene hot word set D, two first words "assan" and "argy" identical to the pinyin of "assan" exist in the scene hot word set D, the user has not performed a query on "assan" and "argy", the popularity of "assan" is 1113, and the popularity of "argy" is 6001. In the above process of obtaining the difference score between the target scene related word and the scene hot word in the target scene hot word set, first, it is determined that "arsal" and "argos" identical to the pinyin of "arsal" exist in the scene hot word set D, and then it is determined that the number 2 of "arsal" and "argos" is greater than 1, and then it is determined that the user has not performed a query on "arsal" or "argos", and it is determined that the difference score of "argos" and "argos" with the highest degree of similarity is the highest. For example, referring to fig. 3d in combination with a specific application scenario, fig. 3d is a schematic view of interaction between a third server and a terminal device provided in an embodiment of the present application, where the target voice information acquired by the terminal device and input by a user is: the server obtains a second text "show with alason", determines that the user's intention is the show that wants to see alason based on the second text, pushes a third page to the terminal device, the third page including user query information like "whether to play alason's show" and a first button "yes" and a second button "no", and performs font-up highlighting on a target scene hot word "alason" in the user prompt information, based on the above flow; the terminal equipment displays the third page; thereafter, if the user clicks the button first button "yes", the server pushes the tv show performed by the alas to the terminal apparatus.

As another optional flow branch, the flow a further includes after the step 304, determining that the difference score of the second vocabulary is the highest if the number of the second vocabulary is equal to 1.

For example, the target scene related word is "alashan", the target scene hot word set is a scene hot word set E, 5 first words "alashan", "alasan", "argos", "araucaria", and "jersey" identical to the pinyin of "alashan" exist in the scene hot word set E, and the user has once queried only for "alashift". In the above-mentioned process of obtaining the difference score between the target scene related word and the scene hot word in the target scene hot word set, it is determined that "ashan", "arson", "arsine" and "arsine" identical to the pinyin of "arsine" exist in the scene hot word set E, it is determined that the number 5 of "arsine", "arsine" and "arsine" is greater than 1, it is determined that the user has performed the query for "arsine", and finally, it is determined that the number 1 of the queried second word "arsine" is equal to 1, and it is determined that the difference score of "arsine" is the highest. For example, referring to fig. 3e in combination with a specific application scenario, fig. 3e is a schematic view of interaction between a fourth server and a terminal device provided in an embodiment of the present application, where if the terminal device obtains target voice information input by a user, the target voice information is: what was said in ashan; the server obtains a second text "what dialect of argos" based on the above flow, and determines that the intention of the user is to know the dialect of argos according to the second text, thereby pushing a fourth page to the terminal device, the fourth page including the dialect information of argos; the fourth page is displayed by the terminal equipment, so that the user can conveniently check the dialect of the Ashan.

As another optional process branch, the process a further includes, after step 301, performing pinyin replacement on the pinyin of the target scene associated word if the first vocabulary does not exist, so as to obtain a replaced pinyin; and comparing the replaced pinyin with the scene hot words in the target scene hot word set to obtain a difference value score between the target scene associated word and the scene hot words in the target scene hot word set.

For example, the target scene related word is "a three", the target scene hot word set is a scene hot word set F, and a first word identical to the pinyin of "a three" does not exist in the scene hot word set F. Based on the above process of obtaining the difference value score of the target scene associated word and the scene hot word in the target scene hot word set, it is determined that the first word identical to the pinyin of the third word is not present in the scene hot word set E, and the pinyin of the third word is obtained "

' performing pinyin replacement to obtain replaced pinyin, if the replaced pinyin is "

", will"

And comparing the ' three-dimensional image data with the scene hot words in the scene hot word set F to obtain the difference value score of the ' three-dimensional image data ' and the scene hot words in the scene hot word set F.

In this example, the server can accurately determine the target scene hot word with the highest difference value score in the target scene hot word set by combining the scene hot word in the target scene hot word set and the pinyin of the target scene associated word, and the query time and the query number of the user on the scene hot word in the target scene hot word set, so that the accuracy of the target scene hot word is improved.

In a possible example, the pinyin replacement of the pinyin of the target scene related word is performed, and the implementation manner of obtaining the replaced pinyin includes but is not limited to: determining a native and/or living address of the user; determining the native place and/or the pronunciation characteristics corresponding to the life address; determining the number of pinyin capable of being replaced by pinyin in each pinyin corresponding to the target scene associated word according to the pronunciation characteristics; and if the pinyin number is larger than 1, sequentially performing pinyin replacement according to the appearance sequence of each character needing pinyin replacement in the target scene associated word to obtain a plurality of replaced pinyins.

Wherein the living address may include an address where the user currently lives and lives for more than a first preset time, and the first preset time may be one year, two years, half a year, and the like. Illustratively, the first preset time is two years, and if the user currently lives in place a for 4 years, place a is the address of life of the user.

In addition, the living address may further include an address where the user has lived for more than a second preset time, which may be two years, three years, five years, and so on, wherein the first preset time and the second preset time may be the same. Preferably, the second preset time is longer than the first preset time. Illustratively, the first preset time is one year, the second preset time is five years, and if the user currently lives in place a for two years, the user once lives in place B for 3 years, and the user once lives in place C for 6 years, then place a and place C are both the addresses of life of the user.

It is understood that the living address may be at least one, and there may or may not be a coincidence between the native address and the living address, for example, the native address of the user is a place, and the living address of the user may be a place and B place; the place of the user is A place, and the living address of the user can be B place and C place.

For example, the target scene related word is "Zou San", the target scene hot word set is a scene hot word set G, a first word identical to the pinyin of "Zou San" does not exist in the scene hot word set G, the place of the user is a place a, and if the pronunciation feature of the place a is that the tongue is flat and tilted. In the specific implementation, after determining that a first vocabulary completely identical to the pinyin of Zou San does not exist in the scene hotword set G, determining the place A of the user, determining the pronunciation characteristic 1 with indifferent flat warping tongue of the place A, and determining the corresponding part of Zou San according to the pronunciation characteristic 1 "

'and'

'all can be replaced by pinyin'

"and"

'the number of pinyins which can be pinyin replaced in the' Zou San 'is 2,2 greater than 1, pinyin replacement is carried out in sequence according to the appearance sequence of' Zhou 'and' III 'which need pinyin replacement in the' Zou San ', and a plurality of replaced pinyins' are obtained "

”、“

"and"

”。

For another example, the target scene related word is "rice in due date", and the target scene hot word set is scene hotA word set H, wherein the scene hotword set H does not have a first word which is completely the same as the pinyin of the ' time-counting meal ', the living address of the user is the place B, and if the pronunciation characteristic of the place B is to pronounce ' go ' as '

'eating rice' with sound "

". In specific implementation, after the situation hotword set H is determined not to have the first word which is completely the same as the pinyin of the ' rice at schedule ', the native place B of the user is determined, and the ' go ' is sounded to be ' in the place B "

The pronunciation feature 1 of 'eating rice' is 'q ī f a n', and the pronunciation feature 1 is used for determining the corresponding 'timing rice'

"and"

'all can be replaced by pinyin'

"and"

'the number of the pinyins which can be replaced by the pinyins is 2,2 is more than 1, and then the pinyins are replaced in turn according to the appearance sequence of' counting 'and' period 'of the pinyins which need to be replaced in' counting period meal ', so as to obtain a plurality of replaced pinyins'

”、“

"and"

”。

For another example, the target scene related word is "booming of a word," the target scene hot word set is a scene hot word set I, the scene hot word set I does not have a first word identical to the pinyin of "booming of a word," the place of the user is a place, the living addresses of the user are a place and a place B, if the pronunciation characteristic of the place a is that the flat-warped tongue is not divided, and if the pronunciation characteristic of the place B is that the "wild" pinyin is used "

"pronunciation is"

". In the specific implementation, after determining that a first vocabulary completely identical to the pinyin of the word-to-word bombing does not exist in the scene hotword set I, determining the place A of the user and the living addresses A and B, determining the pronunciation characteristic of the A without flat-upwarp tongue and dividing the 'crazy' pinyin "

"pronunciation is"

"pronunciation feature 3, determine" booming of word "corresponding to" according to the pronunciation feature 3 "

"and"

All can carry out pinyin replacement "

"and"

"the number of pinyin which can be replaced by pinyin is 2,2If the number of words is more than 1, the pinyin replacement is sequentially carried out according to the appearance sequence of the words and the bombs which need to be subjected to pinyin replacement in the bombs of the words to obtain a plurality of replaced pinyins "

”、“

"and"

”。

In this example, when the server performs pinyin replacement on the pinyin of the target scene associated word, the server can accurately determine the pronunciation characteristics of the user based on the native place and/or the living address of the user, perform pinyin replacement on each pinyin corresponding to the target scene associated word according to the pronunciation characteristics to obtain a replaced pinyin, ensure that the replaced pinyin conforms to the pronunciation habits of the user, and further improve the reliability of the replaced pinyin.

In a possible example, the implementation manner of comparing the replaced pinyin with the scene hotword in the target scene hotword set to obtain a difference value score between the target scene associated word and the scene hotword in the target scene hotword set may include, but is not limited to:

step A1, determining whether a target alternative pinyin which is completely the same as the pinyin of the scene hotword in the target scene hotword set exists in the multiple alternative pinyins.

After step A1, if present, step A2 is performed.

And A2, determining the number of the target alternative pinyins.

After the step A2, if the number of the target alternative pinyins is 1, the step A3 is executed.

And A3, determining that the difference value score of the scene hot word corresponding to the target alternative pinyin is highest.

For example, the target scene related word is "word booming", and the target scene hot word set is a sceneJing Reci set I, and a plurality of alternative pinyins of 'bombing of word' determined "

”、“

"and"

", wherein"

”、“

"and"

"Zhongxiu"

"is identical to the pinyin of scene hot words in the scene hot word set I"

The "corresponding scene hotword is" late wind ". In a specific implementation, first, a plurality of alternative pinyins are determined "

”、“

"and"

"the pinyin of scene hot words in the scene hot word set I is identical"

", then, based on"

"the number is 1, and the difference value score of" late wind "is determined to be the highest. For example, referring to fig. 3f in combination with a specific application scenario, fig. 3f is a schematic view of interaction between a fifth server and a terminal device provided in an embodiment of the present application, where the target voice information acquired by the terminal device and input by a user is: playing music and word bombing; the server obtains a second text 'music playing and late wind', determines that the intention of the user is the late wind of the song to be listened to according to the second text, pushes a fifth page to the terminal device, wherein the fifth page comprises user inquiry information similar to 'whether the song is played or not', a first button 'yes' and a second button 'no', and a target scene hotword 'late wind' in the user prompt information is in a bold font relative to other words; the terminal equipment displays the fifth page; then, if the user clicks the first button 'yes', the server pushes the late wind of the song to the terminal equipment; the terminal equipment plays the late wind of the song pushed by the server.

As an optional branch, after the step A1, if there is no target alternative pinyin in the multiple alternative pinyins that is identical to the pinyin of the scene hotword in the target scene hotword set, generating a prompt message, and sending the prompt message to the terminal device to prompt the user that the user does not recognize the user intention of the user.

As an optional branch, after step A2, if the number of the target alternative pinyins is at least two, step A4 is executed.

And A4, calculating the difference value score of the scene hot word corresponding to each target replacement pinyin according to the number of the replaced pinyins in the target replacement pinyin and the use times or heat of the scene hot word corresponding to the target replacement pinyin by the user.

For example, the target scene related word is "Zou San", the target scene hot word set is a scene hot word set G, and multiple alternatives of "Zou San" are determinedChanging the phonetic letters into "

”、“

"and"

", wherein"

”、“

"and"

"in existence"

"and"

"is identical to the pinyin of scene hot words in scene hot word set G"

"corresponding scene hotwords" Zhou Shan "and" Zhou Shan ",

the "corresponding scene hotword is" Zou Shan ". In a specific implementation, first, a plurality of alternative pinyins are determined "

”、“

"and"

"the pinyin of scene hot words in which the scene hot word set G exists is identical"

"and"

", then, based on"

"and"

The number of "is 2, respectively"

"and"

"and the number of times or heat of use by the user for" Zhou Shan "," Zhou Shan "and" Zou Shan ", calculates the difference value scores of" Zhou Shan "," Zhou Shan "and" Zou Shan ".

In this example, the server can determine the difference value score of the scene hotword in the target scene hotword set by combining the alternative pinyin and the pinyin of the scene hotword in the target scene hotword set, so that the convenience and the intelligence for obtaining the difference value score are improved.

Specifically, the implementation manner of step A4 may include, but is not limited to:

and step B1, determining the first pinyin with the least replaced pinyin in the target replaced pinyins, and determining whether the number of the first pinyins is more than 1.

After the step B1, if the number of the first pinyin is larger than 1, the step B2 is executed.

And B2, determining whether the user uses the scene hot word corresponding to the first pinyin.

After the step B2, if the user uses the scene hotword corresponding to the first pinyin, the step B3 is executed.

And B3, determining whether the number of third words used by the user in the scene hot words corresponding to the first pinyin is greater than 1.

After the step B3, if the number of the third vocabulary is greater than 1, the step B4 is executed.

And step B4, determining that the difference value score of the scene hot word with the highest use frequency or the highest heat degree in the third vocabulary is highest.

For example, the pinyin of the target scene related words "Zou San" and "Zou San" is "

", the target scene hot word set is the scene hot word set J, and the multiple alternative pinyins of" Zou San "are"

”、“

"and"

", wherein"

”、“

"and"

"there exists the above target to replace the pinyin"

"and"

"with in scene hotword set JThe pinyin of scene hot words is completely the same, and is determined "

"and"

The number of "is 2"

"the corresponding scene hotword is" ZhouSan "

The "corresponding scene hotwords are" Zou Shan "and" Zou Shan ", the user used twice" wednesday ", the user used 5 times" Zou Shan ", the user used 11 times" Zou Shan "," wednesday "has a heat of 301," Zou Shan "has a heat of 8032, and" Zou Shan "has a heat of 26. In specific implementation, firstly, the target alternative pinyin is determined "

"and"

"the first pinyin with the least replaced pinyin is"

"and"

Next, determining that the number 2 of the first pinyin is greater than 1, then determining that the third words in the scene hot word corresponding to the first pinyin used by the user are "wednesday", "Zou Shan" and "Zou Shan", and determining that the difference value of "Zou Shan" with the highest number of use in "wednesday", "Zou Shan" and "Zou Shan" has the highest score or determining that the difference value of "Zou Shan" with the highest degree of use in "wednesday", "Zou Shan" and "Zou Shan" has the highest score based on the number 3 of the third words being greater than 1.

As an optional branch, after step B3, if the number of the third vocabulary is equal to 1, step B5 is performed.

And step B5, determining that the difference value score of the third vocabulary is highest.

For example, the target scene related words are pinyin words "Zou San" and "Zou San"

", the target scene hot word set is the scene hot word set K, and the multiple alternative pinyins of" Zou San "are"

”、“

"and"

", wherein"

”、“

"and"

"there exists the above target to replace the pinyin"

"and"

"determined" as the same as the pinyin of the scene hotword in the scene hotword set J "

"and"

The number of "is 2"

"the corresponding scene hot word is" ZhouSanchi "

The "corresponding scene hotwords are" Zou Shan "and" Zou Shan ", the user used" Zou Shan "11 times, the user did not use" wednesday "and" Zou Shan ". In the concrete implementation, firstly, the target alternative pinyin is determined "

'and'

"the first pinyin with the least replaced pinyin is"

"and"

Next, determining that the number 2 of the first pinyin is greater than 1, and then determining that a third vocabulary in a scene hot word corresponding to the first pinyin used by the user is "Zou Shan", and determining that the difference value of "Zou Shan" used by the user is the highest in "Zou Shan", "Zou Shan" and "Zou Shan" based on that the number 1 of the third vocabulary is equal to 1.

As another optional branch, after step B2, if the user does not use the scene hotword corresponding to the first pinyin, step B6 is executed.

And B6, determining that the difference value score of the scene hot word with the highest heat in the scene hot words corresponding to the first pinyin is the highest.

”、“

"and"

", wherein"

”、“

"and"

"there exists the above target to replace the pinyin"

'and'

"determined" as the same as the pinyin of scene hot words in scene hot word set J "

"and"

The number of "is 2"

"the corresponding scene hot word is" ZhouSanchi "

"corresponding fieldJing Reci is "Zou Shan" and "Zou Shan", and the user has not used any of "Wednesday", "Zou Shan" and "Zou Shan", the heat of "Wednesday" is 301, the heat of "Zou Shan" is 8032, and the heat of "Zou Shan" is 26. In the concrete implementation, firstly, the target alternative pinyin is determined "

'and'

"the first pinyin with the least replaced pinyin is"

"and"

Next, determining that the number 2 of the first pinyin is greater than 1, and then determining that the difference value score of the highest-degree scene hot words "Zou Shan" in "wednesday", "Zou Shan" and "Zou Shan" is the highest based on that the scene hot words "wednesday", "Zou Shan" and "Zou Shan" corresponding to the first pinyin are not used by the user.

As still another optional branch, after step B1, if the number of the first pinyins is equal to 1, step B7 is performed.

And B7, determining that the difference value score of the scene hot word corresponding to the first pinyin is highest.

"the target scene hot word set is a scene hot word set G, and the multiple alternative pinyins of" Zou San "are"

”、“

"and"

", wherein"

”、“

"and"

"in existence"

"and"

"determined" as the same pinyin of scene hot words in the scene hot word set I "

"and"

The number of "is 2"

"corresponding scene hotwords" Zhou Shan "and" Zhou Shan ",

the "corresponding scene hotword is" Zou Shan ". In a specific implementation, first, a determination is made "

"and"

"the first pinyin with the least replaced pinyin is"

", then, based on"

"the number of which is equal to 1, then determined"

The difference score of "corresponding scene hotword" Zou Shan "is the highest.

In this example, the server can calculate the difference value score of the scene hot word corresponding to the target replacement pinyin according to the number of the replaced pinyins in the target replacement pinyin and the number of times of use or the popularity of the scene hot word corresponding to the target replacement pinyin by the user when the target replacement pinyin is at least two, so that the comprehensiveness and the accuracy of the determination of the difference value score of the scene hot word are improved.

In one possible example, before the determining the native and/or living address of the user, the method further comprises: obtaining the mandarin level of the user; determining that the Mandarin level does not reach a preset level.

The preset level may be first level B, etc., the preset level may be first level A, etc., the preset level may be second level A, etc. The preset level may be set as desired.

Furthermore, the method further comprises: after obtaining the Mandarin level of the user, if the Mandarin level is determined to reach the preset level, generating prompt information; and sending the prompt information to the terminal equipment to prompt that the user does not recognize the user intention of the user.

Therefore, in the example, when the server performs pinyin replacement on the pinyin of the target scene associated word, the server can accurately determine the pronunciation characteristics of the user based on the mandarin level and the native place and/or the living address of the user, perform pinyin replacement on each pinyin corresponding to the target scene associated word according to the pronunciation characteristics to obtain the replaced pinyin, ensure that the replaced pinyin is more in line with the pronunciation habits of the user, and further improve the reliability of the replaced pinyin.

It can be understood that, since the method embodiment and the apparatus embodiment are different presentation forms of the same technical concept, the content of the method embodiment portion in the present application should be synchronously adapted to the apparatus embodiment portion, and is not described herein again.

Consistent with the above-described embodiments, as shown in fig. 4, fig. 4 is a block diagram of functional units of a speech recognition apparatus according to an embodiment of the present application. In fig. 4, the speech recognition apparatus 400 is applied to a server in a speech recognition system, the speech recognition system includes a terminal device for performing speech interaction between the server and a user, the server includes a human-computer interaction engine for supporting human-computer speech interaction, and the speech recognition apparatus 400 includes:

an obtaining unit 401, configured to invoke the human-computer interaction engine to interact with the user through the terminal device, and obtain target voice information input by the user in the interaction process; the voice recognition module is used for performing character recognition on the target voice information to obtain a first text;

a scene recognition unit 402, configured to perform scene recognition on the first text, and determine a target service scene corresponding to the first text, where the target service scene is used to represent a service type that needs to be provided and is expressed by the first text;

a scene associated word extracting unit 403, configured to perform scene associated word extraction on the first text to obtain a target scene associated word corresponding to the first text, where the target scene associated word is used to represent service content of the service type that needs to be provided and is expressed by the first text;

a scene hot word set query unit 404, configured to perform scene hot word set query according to the target service scene, to obtain a target scene hot word set corresponding to the target service scene;

a comparison unit 405, configured to perform pinyin comparison on the target scene associated word and the scene hotword in the target scene hotword set to obtain a difference score between the target scene associated word and the scene hotword in the target scene hotword set, where the scene hotword is a word whose query hotness is greater than a hotness threshold;

a first determining unit 406, configured to determine a target scene hotword with a highest difference score in the target scene hotword set;

a replacing unit 407, configured to replace the target scene associated word in the first text with the target scene hotword to obtain a second text;

a second determining unit 408, configured to determine, according to the second text, a user intention expressed by the target speech information;

and the service unit 409 is configured to execute a corresponding service operation according to the determined user intention.

In the case of using an integrated unit, as shown in fig. 5, fig. 5 is a block diagram of functional units of another speech recognition apparatus provided in an embodiment of the present application. In fig. 5, a speech recognition apparatus 510 includes: a processing module 512 and a communication module 511.

The processing module 512 is configured to invoke the human-computer interaction engine through the communication module 511 to interact with the user through the terminal device, and acquire target voice information input by the user in the interaction process; performing character recognition on the target voice information to obtain a first text; performing scene recognition on the first text, and determining a target service scene corresponding to the first text, wherein the target service scene is used for representing a service type which is expressed by the first text and needs to be provided; extracting scene associated words from the first text to obtain target scene associated words corresponding to the first text, wherein the target scene associated words are used for representing the service content of the service type required to be provided and expressed by the first text; performing scene hot word set query according to the target service scene to obtain a target scene hot word set corresponding to the target service scene; performing pinyin comparison on the target scene associated word and the scene hot words in the target scene hot word set to obtain a difference value score between the target scene associated word and the scene hot words in the target scene hot word set, wherein the scene hot words are vocabularies with the heat degree greater than a heat degree threshold value, and the heat degree refers to the query heat degree of the vocabularies in all users; determining a target scene hot word with the highest difference value score in the target scene hot word set; replacing the target scene associated words in the first text with the target scene hot words to obtain a second text; determining the user intention expressed by the target voice information according to the second text; and executing corresponding service operation according to the determined user intention. For example, the processing module 512 executes some steps in the acquiring unit 401, the scene identifying unit 402, the scene related word extracting unit 403, the scene hotword set querying unit 404, the comparing unit 405, the first determining unit 406, the replacing unit 407, the second determining unit 408, and the service unit 409, and/or other processes for executing the techniques described herein. The communication module 511 is used to support the interaction between the speech recognition device 510 and other devices. As shown in fig. 5, the speech recognition device 510 may further include a storage module 513, and the storage module 513 is used for storing program codes and data of the speech recognition device 510.

The Processing module 512 may be a Processor or a controller, and may be, for example, a Central Processing Unit (CPU), a general-purpose Processor, a Digital Signal Processor (DSP), an ASIC, an FPGA or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. A processor may also be a combination of computing functions, e.g., comprising one or more microprocessors, a DSP and a microprocessor, or the like. The communication module 511 may be a transceiver, an RF circuit or a communication interface, etc. The storage module 513 may be a memory.

All relevant contents of each scene related to the method embodiment may be referred to the functional description of the corresponding functional module, and are not described herein again. The speech recognition device 510 can perform the speech recognition method shown in fig. 2.

The above embodiments may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions or computer programs. The procedures or functions according to the embodiments of the present application are generated in whole or in part when a computer instruction or a computer program is loaded or executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire or wirelessly. The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more collections of available media. The available media may be magnetic media (e.g., floppy disk, hard disk, magnetic tape), optical media (e.g., DVD), or semiconductor media. The semiconductor medium may be a solid state disk.

Fig. 6 is a block diagram of an electronic device according to an embodiment of the present application. As shown in fig. 6, electronic device 600 may include one or more of the following components: a processor 601, a memory 602 coupled to the processor 601, wherein the memory 602 may store one or more programs, and the one or more programs may be configured to implement the methods described in the embodiments as described above when executed by the one or more processors 601. The electronic device 600 may be a server in the voice recognition system described above.

Processor 601 may include one or more processing cores. The processor 601 connects various parts throughout the electronic device 600 using various interfaces and lines, and performs various functions of the electronic device 600 and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 602, and calling data stored in the memory 602. Alternatively, the processor 601 may be implemented in at least one hardware form of Digital Signal Processing (DSP), field-programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 601 may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a passenger interface, an application program and the like; the GPU is used for rendering and drawing display content; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 601, but may be implemented by a communication chip.

The Memory 602 may include a Random Access Memory (RAM) or a Read-Only Memory (ROM). The memory 602 may be used to store instructions, programs, code sets, or instruction sets. The memory 602 may include a program storage area and a data storage area, wherein the program storage area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the various method embodiments described above, and the like. The storage data area may also store data created during use by the electronic device 600, and the like.

It is understood that the electronic device 600 may include more or less structural elements than those shown in the above structural block diagrams, for example, a power module, a physical button, a Wireless Fidelity (WiFi) module, a speaker, a bluetooth module, a sensor, etc., and is not limited thereto.

Embodiments of the present application also provide a computer storage medium, in which a computer program/instructions are stored, and when executed by a processor, implement part or all of the steps of any one of the methods as described in the above method embodiments.

An embodiment of the present application further provides a computer program product, which includes a computer program/instruction, and when executed by a processor, the computer program/instruction implements the steps of the method according to the first aspect of the embodiment of the present application.

It should be understood that, in the various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

In the several embodiments provided in the present application, it should be understood that the disclosed method, apparatus and system may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative; for example, the division of the cell is only a logic function division, and there may be another division manner in actual implementation; for example, various elements or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be physically included alone, or two or more units may be integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute some steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: u disk, removable hard disk, magnetic disk, optical disk, volatile memory or non-volatile memory. The non-volatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example and not limitation, many forms of Random Access Memory (RAM) are available, such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), synchronous SDRAM (SLDRAM), and direct bus RAM (DR RAM) among various media that can store program code.

Although the present invention is disclosed above, the present invention is not limited thereto. Various changes and modifications can be easily made by those skilled in the art without departing from the spirit and scope of the present invention, and it is within the scope of the present invention to include different functions, combination of implementation steps, software and hardware implementations.

Claims

1. A voice recognition method is applied to a server in a voice recognition system, the voice recognition system comprises the server and terminal equipment for voice interaction of a user, the server comprises a man-machine interaction engine supporting man-machine voice interaction, and the method comprises the following steps:

executing corresponding service operation according to the determined user intention;

the obtaining of the difference value score between the target scene associated word and the scene hotword in the target scene hotword set by performing pinyin comparison between the target scene associated word and the scene hotword in the target scene hotword set includes:

determining whether a first vocabulary completely identical to the pinyin of the target scene associated word exists in the target scene hot word set;

if yes, determining whether the number of the first vocabulary is larger than 1;

if yes, determining whether the user inquires about the first vocabulary once;

if yes, determining whether the number of the second words inquired in the first words is larger than 1;

if yes, determining whether the time interval between the query time of the user for each second vocabulary and the current time is greater than a preset interval;

if yes, determining that the difference value score of the scene hot word with the largest query frequency in the second vocabulary is the highest;

if not, determining that the difference value score of the scene hot word with the shortest time interval between the query time and the current time in the second vocabulary is the highest.

2. The method according to claim 1, wherein after determining whether a first vocabulary completely identical to the pinyin of the target scene associated word exists in the target scene hot word set, if the first vocabulary does not exist, performing pinyin replacement on the pinyin of the target scene associated word to obtain a replaced pinyin; comparing the replaced pinyin with the scene hot words in the target scene hot word set to obtain a difference value score between the target scene associated word and the scene hot words in the target scene hot word set; and the number of the first and second groups,

after determining whether the number of the first vocabulary is greater than 1, if the number of the first vocabulary is equal to 1, determining that the difference score of the first vocabulary is the highest; and the number of the first and second groups,

after determining whether the user has queried the first vocabulary once, if the user has not queried the first vocabulary once, determining that the difference value score of the scene hot word with the highest degree in the first vocabulary is the highest; and the number of the first and second groups,

after determining whether the number of the second vocabulary inquired in the first vocabulary is greater than 1, if the number of the second vocabulary is equal to 1, determining that the difference score of the second vocabulary is the highest.

3. The method of claim 2, wherein the performing pinyin replacement on the pinyin for the target scene associated word to obtain the replaced pinyin comprises:

determining a native and/or living address of the user;

determining the native place and/or the pronunciation characteristics corresponding to the life address;

determining the number of pinyin capable of being subjected to pinyin replacement in each pinyin corresponding to the target scene associated word according to the pronunciation characteristics;

and if the pinyin number is larger than 1, sequentially performing pinyin replacement according to the appearance sequence of each character needing pinyin replacement in the target scene associated word to obtain a plurality of replaced pinyins.

4. The method according to claim 3, wherein the comparing the replaced pinyin with the scene hotwords in the target scene hotword set to obtain a difference score between the target scene associated word and the scene hotwords in the target scene hotword set comprises:

determining whether a target alternative pinyin which is identical to the pinyin of the scene hot word in the target scene hot word set exists in the multiple alternative pinyins;

if yes, determining the number of the target alternative pinyin;

if the number of the target alternative pinyins is 1, determining that the difference value score of the scene hotword corresponding to the target alternative pinyins is the highest;

and if the number of the target alternative pinyins is at least two, calculating the difference value score of the scene hot word corresponding to each target alternative pinyins according to the number of the replaced pinyins in the target alternative pinyins and the use times or the heat degree of the scene hot word corresponding to the target alternative pinyins by the user.

5. The method of claim 4, wherein the calculating a difference score of the scene hotword corresponding to each target replacement pinyin according to the number of the replaced pinyins in the target replacement pinyin and the number of times or the degree of hotness of the user for using the scene hotword corresponding to the target replacement pinyin comprises:

determining a first pinyin with the least replaced pinyin in the target replaced pinyins, and determining whether the number of the first pinyins is more than 1;

if the number of the first pinyin is larger than 1, determining whether the user uses the scene hotword corresponding to the first pinyin;

if the user uses the scene hot word corresponding to the first pinyin, determining whether the number of third words used by the user in the scene hot word corresponding to the first pinyin is greater than 1;

if the number of the third vocabulary is larger than 1, determining that the difference value score of the scene hot word with the highest use frequency or the highest heat degree in the third vocabulary is the highest;

if the number of the third vocabulary is equal to 1, determining that the difference value score of the third vocabulary is the highest;

if the user does not use the scene hot word corresponding to the first pinyin, determining that the difference value score of the scene hot word with the highest heat in the scene hot words corresponding to the first pinyin is the highest;

and if the number of the first pinyin is equal to 1, determining that the difference value score of the scene hotword corresponding to the first pinyin is the highest.

6. The method of claim 3, wherein prior to determining the native and/or living address of the user, the method further comprises:

obtaining the mandarin level of the user;

determining that the Mandarin level does not reach a preset level.

7. A speech recognition device, characterized in that, be applied to the server in the speech recognition system, the speech recognition system includes the terminal equipment that server and user carry out speech interaction, the server includes the man-machine interaction engine that supports man-machine speech interaction, the device includes:

the scene related word extracting unit is used for extracting scene related words from the first text to obtain target scene related words corresponding to the first text, and the target scene related words are used for representing the service content of the service type required to be provided and expressed by the first text;

the first determining unit is used for determining a target scene hot word with the highest difference value score in the target scene hot word set;

the replacing unit is used for replacing the target scene associated word in the first text with the target scene hotword to obtain a second text;

a second determining unit, configured to determine, according to the second text, a user intention expressed by the target speech information; and (c) a second step of,

the service unit is used for executing corresponding service operation according to the determined user intention;

in the aspect that the target scene associated word is pinyin-compared with the scene hotword in the target scene hotword set to obtain a difference score between the target scene associated word and the scene hotword in the target scene hotword set, the comparing unit is specifically configured to:

if yes, determining whether the user inquires about the first vocabulary once;

8. An electronic device comprising a processor, memory, and one or more programs stored in the memory and configured to be executed by the processor, the programs including instructions for performing the steps in the method of any of claims 1-6.

9. A computer-readable storage medium, on which a computer program/instructions are stored, which, when being executed by a processor, carry out the steps of the method of any one of claims 1 to 6.