CN115547337A - Speech recognition method and related product - Google Patents
Speech recognition method and related product Download PDFInfo
- Publication number
- CN115547337A CN115547337A CN202211487069.2A CN202211487069A CN115547337A CN 115547337 A CN115547337 A CN 115547337A CN 202211487069 A CN202211487069 A CN 202211487069A CN 115547337 A CN115547337 A CN 115547337A
- Authority
- CN
- China
- Prior art keywords
- scene
- target
- pinyin
- user
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 81
- 230000003993 interaction Effects 0.000 claims abstract description 52
- 230000008569 process Effects 0.000 claims abstract description 34
- 238000004590 computer program Methods 0.000 claims description 12
- 241001672694 Citrus reticulata Species 0.000 claims description 7
- 235000019633 pungent taste Nutrition 0.000 claims description 3
- 238000000605 extraction Methods 0.000 abstract description 7
- 238000010586 diagram Methods 0.000 description 17
- 238000012545 processing Methods 0.000 description 11
- 238000004891 communication Methods 0.000 description 10
- 230000006870 function Effects 0.000 description 10
- 241000209094 Oryza Species 0.000 description 5
- 235000007164 Oryza sativa Nutrition 0.000 description 5
- 235000009566 rice Nutrition 0.000 description 5
- 210000002105 tongue Anatomy 0.000 description 4
- 208000001836 Firesetting Behavior Diseases 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 241000218631 Coniferophyta Species 0.000 description 2
- OGGXGZAMXPVRFZ-UHFFFAOYSA-N dimethylarsinic acid Chemical compound C[As](C)(O)=O OGGXGZAMXPVRFZ-UHFFFAOYSA-N 0.000 description 2
- 230000002349 favourable effect Effects 0.000 description 2
- 235000012054 meals Nutrition 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 244000071109 Terminalia arjuna Species 0.000 description 1
- 235000000538 Terminalia arjuna Nutrition 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- NOOLISFMXDJSKH-UHFFFAOYSA-N p-menthan-3-ol Chemical compound CC(C)C1CCC(C)CC1O NOOLISFMXDJSKH-UHFFFAOYSA-N 0.000 description 1
- 238000009877 rendering Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
- G10L15/30—Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/10—Speech classification or search using distance or distortion measures between unknown speech and reference templates
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1822—Parsing for meaning understanding
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/223—Execution procedure of a spoken command
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Document Processing Apparatus (AREA)
Abstract
The application provides a voice recognition method and a related product, wherein the method comprises the following steps: the server calls a human-computer interaction engine to interact with a user through terminal equipment, target voice information input by the user in an interaction process is obtained, character recognition is carried out on the target voice information, a first text is obtained, scene recognition and scene associated word extraction are carried out on the first text, a target service scene and a target scene associated word corresponding to the first text are determined, pinyin comparison is carried out on the target scene associated word and a scene associated word in a target scene hot word set corresponding to the target service scene, difference value scores of the target scene associated word and the scene hot word are obtained, the target scene associated word in the first text is replaced by the target scene hot word with the highest difference value score in the target scene hot word set, a second text is obtained, and corresponding service operation is executed according to user intention of the second text. Therefore, the accuracy of voice recognition can be improved, and user experience is improved.
Description
Technical Field
The application belongs to the technical field of general data processing of the Internet industry, and particularly relates to a voice recognition method and a related product.
Background
With the development of internet industry, voice interaction is performed between a user and a device such as a mobile phone, and corresponding services are provided for the user based on voice information input by the user in the interaction process, so that voice recognition is very important to ensure that the services meet the requirements of the user. At present, when a merchant conducts voice recognition, due to the fact that pronunciation of a user is not standard or homophones exist, a voice recognition result is inaccurate.
Disclosure of Invention
The application provides a voice recognition method and a related product, so as to improve the accuracy of voice recognition and improve user experience.
In a first aspect, an embodiment of the present application provides a speech recognition method, which is applied to a server in a speech recognition system, where the speech recognition system includes the server and a terminal device for performing speech interaction with a user, the server includes a human-computer interaction engine supporting human-computer speech interaction, and the method includes:
calling the human-computer interaction engine to interact with the user through the terminal equipment, and acquiring target voice information input by the user in the interaction process; performing character recognition on the target voice information to obtain a first text;
performing scene recognition on the first text, and determining a target service scene corresponding to the first text, wherein the target service scene is used for representing a service type which is expressed by the first text and needs to be provided;
extracting scene associated words from the first text to obtain target scene associated words corresponding to the first text, wherein the target scene associated words are used for representing the service content of the service type required to be provided and expressed by the first text;
performing scene hot word set query according to the target service scene to obtain a target scene hot word set corresponding to the target service scene;
performing pinyin comparison on the target scene associated word and the scene hot word in the target scene hot word set to obtain a difference value score of the target scene associated word and the scene hot word in the target scene hot word set, wherein the scene hot word is a vocabulary with the heat degree greater than a heat degree threshold value, and the heat degree refers to the query heat degree of the vocabulary in all users;
determining a target scene hotword with the highest difference score in the target scene hotword set;
replacing a target scene associated word in the first text with the target scene hot word to obtain a second text;
determining the user intention expressed by the target voice information according to the second text; and (c) a second step of,
and executing corresponding service operation according to the determined user intention.
In a second aspect, an embodiment of the present application provides a speech recognition apparatus, which is applied to a server in a speech recognition system, where the speech recognition system includes a server and a terminal device for performing speech interaction with a user, the server includes a human-computer interaction engine supporting human-computer speech interaction, and the apparatus includes:
the acquisition unit is used for calling the human-computer interaction engine to interact with the user through the terminal equipment, and acquiring target voice information input by the user in the interaction process; performing character recognition on the target voice information to obtain a first text;
a scene recognition unit, configured to perform scene recognition on the first text, and determine a target service scene corresponding to the first text, where the target service scene is used to represent a service type that needs to be provided and is expressed by the first text;
the scene associated word extracting unit is used for extracting a scene associated word from the first text to obtain a target scene associated word corresponding to the first text, wherein the target scene associated word is used for representing the service content of the service type required to be provided and expressed by the first text;
the scene hot word set query unit is used for carrying out scene hot word set query according to the target service scene to obtain a target scene hot word set corresponding to the target service scene;
the comparison unit is used for performing pinyin comparison on the target scene associated word and the scene hot words in the target scene hot word set to obtain a difference value score between the target scene associated word and the scene hot words in the target scene hot word set, wherein the scene hot words are words with the heat degree greater than a heat degree threshold value, and the heat degree refers to the query heat degree of the words in all users;
the first determining unit is used for determining a target scene hot word with the highest difference value score in the target scene hot word set;
the replacing unit is used for replacing the target scene associated words in the first text with the target scene hot words to obtain a second text;
a second determining unit, configured to determine, according to the second text, a user intention expressed by the target speech information; and (c) a second step of,
and the service unit is used for executing corresponding service operation according to the determined user intention.
In a third aspect, an embodiment of the present application provides an electronic device, including a processor, a memory, and one or more programs, stored in the memory and configured to be executed by the processor, where the program includes instructions for performing the steps in the method according to the first aspect of the embodiment of the present application.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, on which a computer program/instructions are stored, which, when executed by a processor, implement the steps of the method according to the first aspect of the embodiments of the present application.
In a fifth aspect, the present application provides a computer program product, which includes a computer program/instruction, and when executed by a processor, implements the steps of the method according to the first aspect of the present application.
In the embodiment of the application, a server firstly calls a human-computer interaction engine to interact with a user through terminal equipment, target voice information input by the user in an interaction process is obtained, character recognition is carried out on the target voice information to obtain a first text, scene recognition and scene associated word extraction are carried out on the first text, a target service scene and a target scene associated word corresponding to the first text are determined, pinyin comparison is carried out on the target scene associated word and a scene associated word in a target scene hot word set corresponding to the target service scene to obtain a difference score between the target scene associated word and the scene hot word, the target scene associated word in the first text is replaced by the target scene hot word with the highest difference score in the target scene hot word set to obtain a second text, a user intention expressed by the target voice information is determined according to the second text, and finally corresponding service operation is executed according to the user intention. Therefore, the server can interact with the user through the terminal equipment, carries out voice recognition on voice information input by the user in the interaction process to obtain a first text, sequentially executes operations of scene recognition of the first text, scene associated word extraction of the first text, pinyin comparison between the scene associated words and scene hot words and the like to correct the first text, obtains a corrected second text, avoids the situation that the pronunciation of the user is not standard or the voice recognition result is inaccurate in a homophone scene, is favorable for improving the accuracy of the voice recognition, and improves the experience of the user.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a block diagram of a speech recognition system according to an embodiment of the present application;
fig. 2 is a schematic flowchart of a speech recognition method according to an embodiment of the present application;
fig. 3a is a schematic flowchart of a process for obtaining a difference score between a target scene related word and a scene hotword in a target scene hotword set according to an embodiment of the present application;
fig. 3b is a schematic diagram of a first server interacting with a terminal device according to an embodiment of the present application;
fig. 3c is a schematic diagram of a second server interacting with a terminal device according to an embodiment of the present application;
fig. 3d is a schematic diagram of a third server interacting with a terminal device according to an embodiment of the present application;
fig. 3e is a schematic diagram of a fourth server interacting with a terminal device according to an embodiment of the present application;
fig. 3f is a schematic diagram of a fifth server interacting with a terminal device according to an embodiment of the present application;
fig. 4 is a block diagram illustrating functional units of a speech recognition apparatus according to an embodiment of the present disclosure;
fig. 5 is a block diagram illustrating functional units of another speech recognition apparatus according to an embodiment of the present application;
fig. 6 is a block diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the present application.
The terms "first," "second," and the like in the description and claims of the present application and in the foregoing drawings are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements but may alternatively include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
First, a system architecture according to an embodiment of the present application will be described.
Referring to fig. 1, fig. 1 is a block diagram of a speech recognition system according to an embodiment of the present disclosure. As shown in fig. 1, a voice recognition system 10 includes a server 11 and a terminal device 12 for performing voice interaction with a user, the server 11 is in communication connection with the terminal device 12, the server 11 includes a human-computer interaction engine supporting human-computer voice interaction, the server 11 interacts with the user through the terminal device 12 by calling the human-computer interaction engine to obtain target voice information input by the user during interaction, performs character recognition on the target voice information to obtain a first text, and analyzes a user intention expressed by the target voice information according to the first text; and executing corresponding service operation according to the determined user intention. The server 11 may be a server, or a server cluster composed of a plurality of servers, or a cloud computing service center, and the terminal device 12 may be a mobile phone terminal, a tablet computer, a notebook computer, or the like.
Based on this, the embodiment of the present application provides a speech recognition method, and the following describes the embodiment of the present application in detail with reference to the accompanying drawings.
Referring to fig. 2, fig. 2 is a flowchart illustrating a speech recognition method according to an embodiment of the present application, where the method is applied to a server 11 in a speech recognition system 10 shown in fig. 1, the speech recognition system 10 includes the server 11 and a terminal device 12 for performing speech interaction with a user, the server 11 includes a human-machine interaction engine supporting speech recognition, and as shown in fig. 2, the method includes:
Wherein, the target service scene is used for representing the service type which is expressed by the first text and needs to be provided. The service scene may be, but is not limited to, one of a song listening service scene, a reading service scene, a video service scene, and a navigation service scene.
For example, when the service scenario is a song listening service scenario, the service type may be a song playing service, when the service scenario is a novel reading service scenario, the service type may be a novel pushing service, when the service scenario is a shopping service scenario, the service type may be a commodity pushing service, and when the service scenario is a navigation service scenario, the service type may be a navigation service.
The target scene associated word is used for representing the service content of the service type which is expressed by the first text and needs to be provided.
Illustratively, the service type is a song playing service, the scenario associated word is a name of a person, and the service content is that the server pushes a song associated with the name of the person to the terminal device so as to be played by the terminal device, at this time, the name of the person may be, but is not limited to, one of a singer, a writer, and a composer of the song, and is not specifically limited; the service type is a novel push service, the scene associated word is a name of a person, the service content is that the server pushes novel information associated with the name of the person to the terminal device so as to be displayed by the terminal device, and at this time, the name of the person can indicate one of an author, a recommender and an instructor of the novel, but is not limited specifically; the service type is commodity pushing service, the scene associated word is a commodity name, and the service content is that the server pushes commodity information associated with the commodity name to the terminal equipment so as to be displayed by the terminal equipment conveniently; the service type is navigation service, the scene associated word is a place name, and the service content is that the server pushes navigation information associated with the place name to the terminal equipment so that the terminal equipment can conveniently navigate to the place name for a user.
And 204, inquiring a scene hot word set according to the target service scene to obtain a target scene hot word set corresponding to the target service scene.
The corresponding relation between various service scenes and the scene hot word set is preset.
The scene hot words are words with the heat degree larger than a heat degree threshold value, and the heat degree in the application refers to the query heat degree of the words in all users. The higher the popularity of the vocabulary is, the more the query times of all users for the vocabulary is, and conversely, the lower the popularity of the vocabulary is, the less the query times of all users for the vocabulary is.
It should be noted that the larger the difference score between two vocabularies, the higher the similarity between the two vocabularies, i.e. the more similar the two vocabularies are, whereas the smaller the difference score between the two vocabularies, the lower the similarity between the two vocabularies, i.e. the greater the difference between the two vocabularies is.
And step 207, replacing the target scene associated word in the first text with the target scene hotword to obtain a second text.
And step 208, determining the user intention expressed by the target voice information according to the second text.
And step 209, executing corresponding service operation according to the determined user intention.
The target scene hot word set may include scene hot words that are the same as the target scene related words, and the target scene hot word set may not include scene hot words that are the same as the target scene related words.
It can be seen that, in the embodiment of the application, a server firstly calls a human-computer interaction engine to interact with a user through a terminal device, target voice information input by the user is obtained in an interaction process, character recognition is performed on the target voice information, a first text is obtained, scene recognition and scene associated word extraction are performed on the first text, a target service scene and a target scene associated word corresponding to the first text are determined, pinyin comparison is performed on the target scene associated word and a scene hot word in a target scene hot word set corresponding to the target service scene, a difference value score between the target scene associated word and the scene hot word is obtained, the target scene associated word in the first text is replaced by the target scene hot word with the highest difference value score in the target scene hot word set, a second text is obtained, a user intention expressed by the target voice information is determined according to the second text, and finally, corresponding service operation is executed according to the user intention. Therefore, the server can interact with the user through the terminal equipment, carries out voice recognition on voice information input by the user in the interaction process to obtain a first text, sequentially executes operations of scene recognition of the first text, scene associated word extraction of the first text, pinyin comparison between the scene associated words and scene hot words and the like to correct the first text, obtains a corrected second text, avoids the condition that the pronunciation of the user is not standard or the voice recognition result is inaccurate under the homophone scene, is favorable for improving the accuracy of the voice recognition, and improves the experience of the user.
For convenience of understanding, a process of obtaining a difference value score between a target scene related word and a scene hotword in a target scene hotword set in the embodiment of the present application will be described below.
Referring to fig. 3a, fig. 3a is a schematic flowchart of a process for obtaining a difference score between a target scene associated word and a scene hotword in a target scene hotword set according to an embodiment of the present disclosure, and as shown in fig. 3a, the process a for obtaining a difference score between a target scene associated word and a scene hotword in a target scene hotword set includes:
If so, go to step 302.
After step 302, if yes, step 303 is executed.
After step 303, if yes, go to step 304.
And step 304, determining whether the number of the second words queried in the first words is larger than 1.
After step 304, if yes, step 305 is performed.
And 305, determining whether the time interval between the query time of the user for each second vocabulary and the current time is greater than a preset interval.
The preset interval may be, but not limited to, 10 days, 15 days, 30 days, etc., and is not particularly limited.
After step 305, if yes, go to step 306.
For example, the preset time interval is 10 days, the target scene related word is "apore", the target scene hot word set is a scene hot word set B, 4 first words "ashan", "arse", "araucaria" and "argy" identical to the pinyin of "apore" exist in the scene hot word set B, the query time of the user for "arse" is 11 days from the current time, the query time of the user for "argy" is 12 days from the current time, the query times of the user for "arse" are 5 times, the query times of the user for "argy" is 1 time, and no query records of the user for "ashan" and "araucaria" exist. In the above process of obtaining the difference score between the target scene related word and the scene hotword in the target scene hotword set, it is determined that "ashan", "arson", "arjest" completely identical to the pinyin of "arjest" exists in the scene hotword set B, then it is determined that the number 4 of "ashan", "arson", "arjest" is greater than 1, then it is determined that the user has performed a query on "ashan", and the user has also performed a query on "arjest", then it is determined that the number 2 of "arson" and "arjest" that have been queried is greater than 1, then it is determined that the query time of the user on "arsan" is greater than 10 days from the current time 11 days, and the query time of the user on "arjest" is greater than 10 days from the current time 12 days, and then it is determined that the difference score of "arshansan" with the largest number of queries in "arjest" and "arjest" is the highest. For example, with reference to fig. 3b in combination with a specific application scenario, fig. 3b is a schematic diagram of interaction between a first server and a terminal device provided in an embodiment of the present application, where the server asks a user: asking what help is needed; if the terminal equipment acquires the target voice information input by the user, the target voice information is as follows: i want to see the fiction of deleting; the server obtains a second text "i want to see the fiction of asharad" based on the above flow, determines that the intention of the user is to see the fiction of asharad according to the second text, and pushes a first page to the terminal device, wherein the first page comprises a website link www of the fiction of asharad. The user operation prompt message of "preferably, the target scene hotword" arsan "in the user operation prompt message can be highlighted, such as bolded, darkened, etc.; the terminal equipment displays the first page; the user clicks "www.
As an optional flow branch, the flow a further includes after the step 305, if no, executing a step 307.
For example, the preset time interval is 5 days, the target scene related word is "ashan", the target scene hot word set is a scene hot word set a, 3 first words "ashan", "arsal" and "argy" identical to the pinyin of "ashan" exist in the scene hot word set a, the query time of the user for "ashan" is 3 days from the current time, the query time of the user for "arsal" is 4 days from the current time, and the query time of the user for "argy" is 7 days from the current time. In the above process of obtaining a difference score between a target scene related word and a scene hot word in a target scene hot word set, it is first determined that "ashan", "arser" identical to the pinyin of "ashan" exists in the scene hot word set a, then it is determined that the number 3 of "ashan", "arser" is greater than 1, then it is determined that the user has performed queries on all of "ashan", "arser", and "arlong", respectively, then it is determined that the number 3 of "ashan", "arser", and "arlong" that have been queried is greater than 1, then it is determined that the query time of the user on "ashan" is less than 5 days from the current time, and the query time of the user on "arser" is less than 5 days from the current time for 4 days, and then it is determined that the difference score of "ashan" in which the time interval between the query time and the current time is the shortest. Illustratively, in conjunction with a particular application scenario, the server asks the user: asking what service is needed, if the terminal device obtains the target voice information input by the user: the server obtains a second text 'the song of the mountain is heard by the server' based on the above process, and the subsequent reply is determined according to the second text as follows: the music of the ashan is played for the user, and meanwhile, the server can push the music of the ashan to the terminal device, so that the terminal device can play the music of the ashan.
As another optional flow branch, after the step 302, if the number of the first vocabulary is equal to 1, the flow a further includes determining that the difference score of the first vocabulary is the highest.
For example, the target scene related word is "delete", the target scene hot word set is a scene hot word set C, and 1 first word "argy" identical to the pinyin of "delete" exists in the scene hot word set C. Based on the above procedure of obtaining the difference value score of the target scene related word and the scene hot word in the target scene hot word set, first, it is determined that the first word "argy" identical to the pinyin of "argy" exists in the scene hot word set C, and finally, it is determined that the number 1 of "argy" is equal to 1, and then it is determined that the difference value score of "argy" is the highest. For example, with reference to fig. 3c in combination with a specific application scenario, fig. 3c is a schematic diagram of interaction between a second server and a terminal device provided in an embodiment of the present application, where the server asks a user a question: asking what service is needed, if the terminal device obtains the target voice information input by the user: i want to listen to the song of argy, the server obtains a second text "i want to listen to the song of argy" based on the above process, determines that the user's intention is to listen to the song of argy according to the second text, and pushes a second page to the terminal device, where the second page includes a user prompt message like "the song of argy will be played for you", and preferably, the target scene hotword "argy" in the user prompt message can be highlighted, such as bold, deepened color, enlarged font, etc.; the terminal equipment displays the second page; then, the server pushes the song of ARARS to the terminal equipment; the terminal device plays the song of ARARS pushed by the server.
As another optional flow branch, after step 303, if the user has not queried the first vocabulary, the flow a further includes determining that the difference score of the scene hotword with the highest degree in the first vocabulary is the highest.
For example, the target scene related word is "assam", the target scene hot word set is a scene hot word set D, two first words "assam" and "argy" identical to the pinyin of "assam" exist in the scene hot word set D, the user has not performed a query on "assam" and "argy", the heat of "assam" is 1113, and the heat of "argy" is 6001. In the above process of obtaining the difference score between the target scene related word and the scene hot word in the target scene hot word set, first, it is determined that "arsal" and "argos" identical to the pinyin of "arsal" exist in the scene hot word set D, and then it is determined that the number 2 of "arsal" and "argos" is greater than 1, and then it is determined that the user has not performed a query on "arsal" or "argos", and it is determined that the difference score of "argos" and "argos" with the highest degree of similarity is the highest. For example, referring to fig. 3d in combination with a specific application scenario, fig. 3d is a schematic diagram of interaction between a third server and a terminal device according to an embodiment of the present application, where the target voice information that is acquired by the terminal device and input by a user is: the server obtains a second text "show with alason", determines that the user's intention is the show that wants to see alason based on the second text, pushes a third page to the terminal device, the third page including user query information like "whether to play alason's show" and a first button "yes" and a second button "no", and performs font-up highlighting on a target scene hot word "alason" in the user prompt information, based on the above flow; the terminal equipment displays the third page; thereafter, if the user clicks the first button "yes", the server pushes the tv show performed by the alas to the terminal apparatus.
As another optional flow branch, the flow a further includes after the step 304, determining that the difference score of the second vocabulary is the highest if the number of the second vocabulary is equal to 1.
For example, the target scene related word is "ashan", the target scene hot word set is a scene hot word set E, there are 5 first words "ashan", "arsin", "arauca", "jersey" in the scene hot word set E, which are identical to the pinyin of "ashan", and the user has made a query only for "arsin". In the above process of obtaining the difference score between the target scene related word and the scene hot word in the target scene hot word set, it is determined that "ashan", "arser", "arauca", "jerusal" identical to the pinyin of "arsine" are present in the scene hot word set E, and it is determined that the number 5 of "arsine", "arser", "arjun", "arauca" is greater than 1, and then it is determined that the user has made a query for "arlong", and finally it is determined that the number 1 of the second word "arlong" that has been queried is equal to 1, and it is determined that the difference score of "arlong" is the highest. For example, referring to fig. 3e in combination with a specific application scenario, fig. 3e is a schematic view of interaction between a fourth server and a terminal device provided in an embodiment of the present application, where if the terminal device obtains target voice information input by a user, the target voice information is: what was said in ashan; the server obtains a second text "what dialect of argos" based on the above flow, and determines that the intention of the user is to know the dialect of argos according to the second text, thereby pushing a fourth page to the terminal device, the fourth page including the dialect information of argos; the terminal equipment displays the fourth page, so that the user can conveniently view the dialect of the Aoshan.
As another optional process branch, the process a further includes, after the step 301, performing pinyin replacement on the pinyin of the target scene associated word if the first vocabulary does not exist, so as to obtain a replaced pinyin; and comparing the replaced pinyin with the scene hotwords in the target scene hotword set to obtain a difference value score of the target scene associated word and the scene hotwords in the target scene hotword set.
For example, the target scene related word is "a three", the target scene hot word set is a scene hot word set F, and a first word identical to the pinyin of "a three" does not exist in the scene hot word set F. Based on the above process of obtaining the difference value score of the target scene associated word and the scene hot word in the target scene hot word set, it is determined that the first word identical to the pinyin of the third word is not present in the scene hot word set E, and the pinyin of the third word is obtained "' performing pinyin replacement to obtain replaced pinyin, if the replaced pinyin is "", will"And comparing the ' three-dimensional image data with the scene hot words in the scene hot word set F to obtain the difference value score of the ' three-dimensional image data ' and the scene hot words in the scene hot word set F.
In this example, the server can accurately determine the target scene hot word with the highest difference value score in the target scene hot word set by combining the scene hot word in the target scene hot word set and the pinyin of the target scene associated word, and the query time and the query number of the user on the scene hot word in the target scene hot word set, so that the accuracy of the target scene hot word is improved.
In a possible example, the pinyin replacement is performed on the pinyin of the target scene associated word, and the implementation manner of obtaining the pinyin after replacement includes but is not limited to: determining a native and/or living address of the user; determining pronunciation characteristics corresponding to the native place and/or the life address; determining the number of pinyin capable of being replaced by pinyin in each pinyin corresponding to the target scene associated word according to the pronunciation characteristics; and if the pinyin number is more than 1, sequentially performing pinyin replacement according to the occurrence sequence of each character needing pinyin replacement in the target scene associated word to obtain a plurality of replaced pinyins.
Wherein the living address may include an address where the user currently lives and lives for more than a first preset time, and the first preset time may be one year, two years, half year, and the like. Illustratively, the first predetermined time is two years, and if the user currently lives in place a and lives in place a for 4 years, place a is the user's address of life.
In addition, the living address may further include an address where the user has lived for more than a second preset time, which may be two years, three years, five years, etc., wherein the first preset time and the second preset time may be the same. Preferably, the second preset time is longer than the first preset time. Illustratively, the first preset time is one year, the second preset time is five years, and if the user currently lives in place a for two years, the user once lives in place B for 3 years, and the user once lives in place C for 6 years, then place a and place C are both the addresses of life of the user.
It is understood that the living address may be at least one, and there may or may not be a coincidence between the native address and the living address, for example, the native address of the user is a place, and the living address of the user may be a place and B place; the place of the user is A place, and the living address of the user can be B place and C place.
For example, the target scene associated word is "winy three", the target scene hotword set is a scene hotword set G, a first vocabulary completely identical to pinyin of "winy three" does not exist in the scene hotword set G, the native place of the user is a place a, and if the pronunciation characteristic of the place a is that the flat warped tongue is not divided. In the specific implementation, after determining that a first vocabulary completely identical to the pinyin of 'Zhongsan' does not exist in the scene hotword set G, determining the native place of the user A, determining the pronunciation characteristic 1 of the place A with indifferent flat warping tongues, and determining the corresponding 'Zhongsan' according to the pronunciation characteristic 1 "'and''all can be replaced by pinyin'"and"In which the number of pinyin capable of being replaced by pinyin is 2,2 is more than 1, and according to the occurrence of the Chinese characters 'Zhou' and 'III' needing pinyin replacementSequentially carrying out pinyin replacement to obtain a plurality of replaced pinyin "”、“"and"”。
For another example, the target scene related word is "rice at schedule", the target scene hot word set is a scene hot word set H, a first word identical to the pinyin of "rice at schedule" does not exist in the scene hot word set H, the living address of the user is place B, and if the pronunciation feature of the place B is that "go" is pronounced as "'eating rice' with sound "". In the specific implementation, after the first vocabulary completely identical to the pinyin of the ' time-of-arrival meal ' does not exist in the scene hotword set H, the native place B of the user is determined, and the ' go ' is sounded as ' in the place B "The pronunciation of the 'eating rice' is the pronunciation feature 1 of the 'q \299faran', and the pronunciation feature 1 is used for determining the corresponding 'counting rice';"and"All can carry out pinyin replacement "'and'The number of the pinyins which can be replaced by the pinyins is 2,2 is more than 1, and the pinyins are sequentially replaced according to the occurrence sequence of the counting and the period of the pinyins needing to be replaced in the counting meal to obtain a plurality of replaced pinyins "”、“"and"”。
For another example, the target scene related word is "booming of a word," the target scene hot word set is a scene hot word set I, the scene hot word set I does not have a first word identical to the pinyin of "booming of a word," the place of the user is a place, the living addresses of the user are a place and a place B, if the pronunciation characteristic of the place a is that the flat-warped tongue is not divided, and if the pronunciation characteristic of the place B is that the "wild" pinyin is used ""pronunciation is"". In specific implementation, after determining that a first vocabulary completely identical to the pinyin of 'boom of words to' does not exist in a scene hotword set I, determining the native place A and the living addresses A and B of the user, determining the pronunciation characteristic of A with indifferent horizontal warping tongue and determining the pinyin of 'mad'"pronunciation is""pronunciation feature 3, determine" booming of word "corresponding to" according to the pronunciation feature 3 ""and"All can carry out pinyin replacement ""and"The number of the pinyin capable of being replaced by pinyin is 2,2 is more than 1, and the pinyin replacement is sequentially carried out according to the appearance sequence of the words and the bombs needing to be replaced by pinyin in the bombs of words to obtain a plurality of replaced pinyins "”、“"and"”。
In this example, when the server performs pinyin replacement on the pinyin of the target scene associated word, the server can accurately determine the pronunciation characteristics of the user based on the native place and/or the living address of the user, perform pinyin replacement on each pinyin corresponding to the target scene associated word according to the pronunciation characteristics to obtain a replaced pinyin, ensure that the replaced pinyin conforms to the pronunciation habits of the user, and further improve the reliability of the replaced pinyin.
In a possible example, the implementation manner of comparing the replaced pinyin with the scene hotword in the target scene hotword set to obtain the difference value score between the target scene associated word and the scene hotword in the target scene hotword set may include, but is not limited to:
step A1, determining whether a target alternative pinyin which is completely the same as the pinyin of the scene hotword in the target scene hotword set exists in the multiple alternative pinyins.
After step A1, if present, step A2 is performed.
And A2, determining the number of the target alternative pinyins.
After the step A2, if the number of the target alternative pinyins is 1, the step A3 is executed.
And A3, determining that the difference value score of the scene hot word corresponding to the target alternative pinyin is highest.
For example, the target scene associated word is "word-to-word bang", the target scene hot word set is a scene hot word set I, and the determined multiple alternative pinyins of "word-to-word bang" are "”、“"and"", wherein"”、“"and""Zhongxiu""is identical to the pinyin of scene hot words in the scene hot word set I"The "corresponding scene hotword is" late wind ". In the concrete implementation, firstly, a plurality of alternative pinyins are determined "”、“"and""the pinyin of scene hotword in scene hotword set I is identical"", then, based on""the number is 1, and the difference value score of" late wind "is determined to be the highest. For example, referring to fig. 3f in combination with a specific application scenario, fig. 3f is a schematic view of interaction between a fifth server and a terminal device provided in an embodiment of the present application, where the target voice information acquired by the terminal device and input by a user is: playing music and word bombing; the server obtains a second text 'music playing and late wind', determines that the intention of the user is the late wind of the song to be listened to according to the second text, pushes a fifth page to the terminal device, wherein the fifth page comprises user inquiry information similar to 'whether the song is played or not', a first button 'yes' and a second button 'no', and a target scene hotword 'late wind' in the user prompt information is in a bold font relative to other words; the terminal equipment displays the fifth page; then, if the user clicks a first button 'yes', the server pushes the late-arriving wind of the song to the terminal device; the terminal equipment plays the late wind of the song pushed by the server.
As an optional branch, after the step A1, if there is no target alternative pinyin in the multiple alternative pinyins that is identical to the pinyin of the scene hotword in the target scene hotword set, generating a prompt message, and sending the prompt message to the terminal device to prompt the user that the user does not recognize the user intention of the user.
As an optional branch, after the step A2, if the number of the target alternative pinyins is at least two, the step A4 is executed.
And A4, calculating the difference value score of the scene hot word corresponding to each target alternative pinyin according to the number of the replaced pinyins in the target alternative pinyin and the use times or heat of the scene hot word corresponding to the target alternative pinyin by the user.
For example, the target scene relevant word is "winy three", the target scene hotword set is a scene hotword set G, and a plurality of determined pinyin alternatives of "winy three" are "”、“"and"", wherein"”、“"and""in existence""and""exactly the same as the pinyin of scene hotword in scene hotword set G""the corresponding scene hotwords are" Zhoushan "and" Subson ""corresponding scene hot word" in rustic. In the concrete implementation, firstly, a plurality of alternative pinyins are determined "”、“"and""the pinyin of scene hot words in which the scene hot word set G exists is identical"'and'", then, based on"'and'The number of "is 2, respectively"'and'"and a difference value score of" zhou shan "," week long ", and" period "calculated by the user for the number of uses or heat of" zhou mountain "," week long ", and" period ".
In this example, the server can determine the difference value score of the scene hotword in the target scene hotword set by combining the alternative pinyin and the pinyin of the scene hotword in the target scene hotword set, so that the convenience and the intelligence for obtaining the difference value score are improved.
Specifically, the implementation manner of step A4 may include, but is not limited to:
and step B1, determining the first pinyin with the least replaced pinyin in the target replaced pinyins, and determining whether the number of the first pinyins is more than 1.
After the step B1, if the number of the first pinyin is larger than 1, the step B2 is executed.
And B2, determining whether the user uses the scene hot word corresponding to the first pinyin.
After the step B2, if the user uses the scene hotword corresponding to the first pinyin, the step B3 is executed.
And B3, determining whether the number of third words used by the user in the scene hot words corresponding to the first pinyin is greater than 1.
After the step B3, if the number of the third vocabulary is greater than 1, the step B4 is executed.
And step B4, determining that the difference value score of the scene hot word with the highest use frequency or the highest heat degree in the third vocabulary is highest.
For example, pinyin of the above relevant words in target scene of "in three", "in three"", the above target scene hot word set is the scene hot word set J, and the multiple alternative pinyins of" Zhongsan "are"”、“"and"", wherein"”、“"and""there exists the above target to replace the pinyin""and""determined" as the same as the pinyin of scene hot words in scene hot word set J ""and"The number of "is 2""the corresponding scene hotword is" ZhouSan ""corresponding scene hotwords are" in (in) "and" in (in) hill ", the user used twice" saturday ", the user used 5 times" in (in) sword ", the user used 11 times" in (in) hill ", the heat of" saturday "is 301, the heat of" in (in) hill "is 8032, and the heat of" in (in) hill "is 26. In the concrete implementation, firstly, the target alternative pinyin is determined ""and""the first pinyin with the least replaced pinyin is""and"Next, it is determined that the number 2 of the first pinyin is greater than 1, and thereafter, it is determined that the third vocabulary in the scene hot words which the user used the first pinyin corresponds to is "saturday", "zhou mountain" and "zhou", and it is determined that the difference value score of "zhou san", "zhou mountain" and "zhou san" which are the most frequently used among the "york mountains" is the highest or it is determined that the difference value score of "zhou san", "you mountain" and "zhou san" which are the most highly used among the "zhou san", "you san" and "zhou san" is the highest, based on the number 3 of the third vocabulary being greater than 1.
As an optional branch, after step B3, if the number of the third vocabulary is equal to 1, step B5 is executed.
And step B5, determining that the difference value score of the third vocabulary is highest.
For example, pinyin of the above relevant words in target scene of "in three", "in three"", the above-mentioned target scene hot word set is a scene hot word set K, and" Chinese character of 'zhongsan' is a plurality of alternative pinyins "”、“"and"", wherein"”、“"and""there is above-mentioned target to replace the spelling in it""and""determined" as the same as the pinyin of scene hot words in scene hot word set J ""and"The number of "is 2""the corresponding scene hot word is" ZhouSanchi ""corresponding scene hotwords are" in jest "and" in zhou ", the user used" in zhou "11 times, and the user did not use" satris "and" zhou ". In the concrete implementation, firstly, the target alternative pinyin is determined "'and'"the first pinyin with the least replaced pinyin is"'and'Next, determining that the number 2 of the first pinyin is greater than 1, and then determining that a third vocabulary in a scene hotword corresponding to the first pinyin, which is used by a user, is "zhoushan", and determining that a plurality of jagses "are present based on the number 1 of the third vocabulary being equal to 1"In "in zhou mountain" and "in zhou", used by the user, the differential value score is highest.
As another optional branch, after step B2, if the user does not use the scene hotword corresponding to the first pinyin, step B6 is executed.
And B6, determining that the difference value score of the scene hot word with the highest heat in the scene hot words corresponding to the first pinyin is the highest.
For example, pinyin of the above relevant words in target scene of "in three", "in three"", the above-mentioned target scene hot word set is a scene hot word set K, and" Chinese character of 'zhongsan' is a plurality of alternative pinyins "”、“"and"", wherein"”、“"and""there is above-mentioned target to replace the spelling in it"'and'"determined" as the same as the pinyin of the scene hotword in the scene hotword set J ""and"The number of "is 2""the corresponding scene hot word is" ZhouSanchi ""corresponding scene thermal words are" satrse "and" zhou ", which the user has not used any of" satris "," jersey ", and" zhou ", the heat of" satris "is 301, the heat of" satrse "is 8032, and the heat of" zhou "is 26. In the concrete implementation, firstly, the target alternative pinyin is determined ""and""the first pinyin with the least replaced pinyin is""and"Next, it is determined that the number 2 of the first pinyin is greater than 1, and then it is determined that the difference value score of the most intense "bow" among "satay", "zhou", and "zhou" is the highest based on the scene hot words "satay", "zhou", and "jerry" to which the user has not used the first pinyin.
As another optional branch, after step B1, if the number of the first pinyins is equal to 1, step B7 is executed.
And B7, determining that the difference value score of the scene hot word corresponding to the first pinyin is highest.
For example, the target scene relevant words are "zhou," "zhou," and pinyin are "", the target scene hotword set is a scene hotword set G, and a plurality of replaced pinyins of" Zhoushi ""”、“"and"", wherein"”、“"and""in existence""and""the pinyin of the scene hot word in the scene hot word set I is completely the same, and the scene hot word is determined"'and'The number of "is 2""the corresponding scene hotwords are" Zhoushan "and" Subson ""corresponding scene hotword is" in jest ". In a specific implementation, first, a determination is made "'and'"the first pinyin with the least pinyin replaced is"", then, based on""if the number of" is equal to 1, then it is determined "The difference score of "corresponding scene hotword" in(s) "is highest.
In this example, the server can calculate the difference value score of the scene hot word corresponding to the target replacement pinyin according to the number of the replaced pinyins in the target replacement pinyin and the number of times of use or the popularity of the scene hot word corresponding to the target replacement pinyin by the user when the target replacement pinyin is at least two, so that the comprehensiveness and the accuracy of the determination of the difference value score of the scene hot word are improved.
In one possible example, before the determining the native and/or living address of the user, the method further comprises: obtaining the mandarin level of the user; determining that the Mandarin level does not reach a preset level.
The preset level may be first level B, etc., the preset level may be first level A, etc., the preset level may be second level A, etc. The preset level may be set as desired.
Furthermore, the method further comprises: after the Mandarin level of the user is obtained, if the Mandarin level is determined to reach the preset level, prompt information is generated; and sending the prompt information to the terminal equipment to prompt that the user does not recognize the user intention of the user.
In this example, when the server performs pinyin replacement on the pinyin of the target scene associated word, the server can accurately determine the pronunciation characteristics of the user based on the mandarin level, the native place and/or the living address of the user, perform pinyin replacement on each pinyin corresponding to the target scene associated word according to the pronunciation characteristics to obtain a replaced pinyin, ensure that the replaced pinyin better conforms to the pronunciation habit of the user, and further improve the reliability of the replaced pinyin.
It can be understood that, since the method embodiment and the apparatus embodiment are different presentation forms of the same technical concept, the content of the method embodiment portion in the present application should be synchronously adapted to the apparatus embodiment portion, and is not described herein again.
Consistent with the above-described embodiments, as shown in fig. 4, fig. 4 is a block diagram of functional units of a speech recognition apparatus according to an embodiment of the present application. In fig. 4, the speech recognition apparatus 400 is applied to a server in a speech recognition system, the speech recognition system includes a terminal device for performing speech interaction between the server and a user, the server includes a human-computer interaction engine for supporting human-computer speech interaction, and the speech recognition apparatus 400 includes:
an obtaining unit 401, configured to invoke the human-machine interaction engine to interact with the user through the terminal device, and obtain target voice information input by the user in the interaction process; the voice recognition module is used for performing character recognition on the target voice information to obtain a first text;
a scene recognition unit 402, configured to perform scene recognition on the first text, and determine a target service scene corresponding to the first text, where the target service scene is used to represent a service type that needs to be provided and is expressed by the first text;
a scene related word extracting unit 403, configured to perform scene related word extraction on the first text to obtain a target scene related word corresponding to the first text, where the target scene related word is used to represent service content of the service type that needs to be provided and is expressed by the first text;
a scene hotword set query unit 404, configured to perform scene hotword set query according to the target service scene to obtain a target scene hotword set corresponding to the target service scene;
a comparison unit 405, configured to perform pinyin comparison on the target scene associated word and the scene hotword in the target scene hotword set to obtain a difference score between the target scene associated word and the scene hotword in the target scene hotword set, where the scene hotword is a word whose query hotness is greater than a hotness threshold;
a first determining unit 406, configured to determine a target scene hotword with a highest difference score in the target scene hotword set;
a replacing unit 407, configured to replace the target scene associated word in the first text with the target scene hotword to obtain a second text;
a second determining unit 408, configured to determine, according to the second text, a user intention expressed by the target speech information;
and the service unit 409 is configured to execute a corresponding service operation according to the determined user intention.
It can be understood that, since the method embodiment and the apparatus embodiment are different presentation forms of the same technical concept, the content of the method embodiment portion in the present application should be synchronously adapted to the apparatus embodiment portion, and is not described herein again.
In the case of using an integrated unit, as shown in fig. 5, fig. 5 is a block diagram of functional units of another speech recognition apparatus provided in the embodiment of the present application. In fig. 5, a speech recognition apparatus 510 includes: a processing module 512 and a communication module 511.
The processing module 512 is configured to invoke the human-computer interaction engine through the communication module 511 to interact with the user through the terminal device, and acquire target voice information input by the user in the interaction process; performing character recognition on the target voice information to obtain a first text; performing scene recognition on the first text, and determining a target service scene corresponding to the first text, wherein the target service scene is used for representing a service type which is expressed by the first text and needs to be provided; extracting scene associated words from the first text to obtain target scene associated words corresponding to the first text, wherein the target scene associated words are used for representing the service content of the service type required to be provided and expressed by the first text; performing scene hot word set query according to the target service scene to obtain a target scene hot word set corresponding to the target service scene; performing pinyin comparison on the target scene associated word and the scene hot word in the target scene hot word set to obtain a difference value score of the target scene associated word and the scene hot word in the target scene hot word set, wherein the scene hot word is a vocabulary with the heat degree greater than a heat degree threshold value, and the heat degree refers to the query heat degree of the vocabulary in all users; determining a target scene hot word with the highest difference value score in the target scene hot word set; replacing a target scene associated word in the first text with the target scene hot word to obtain a second text; determining the user intention expressed by the target voice information according to the second text; and executing corresponding service operation according to the determined user intention. For example, the processing module 512 executes some steps in the acquiring unit 401, the scene identifying unit 402, the scene related word extracting unit 403, the scene hotword set querying unit 404, the comparing unit 405, the first determining unit 406, the replacing unit 407, the second determining unit 408, and the service unit 409, and/or other processes for executing the techniques described herein. The communication module 511 is used to support interaction between the speech recognition apparatus 510 and other devices. As shown in fig. 5, the speech recognition device 510 may further include a storage module 513, and the storage module 513 is used for storing program codes and data of the speech recognition device 510.
The Processing module 512 may be a Processor or a controller, and may be, for example, a Central Processing Unit (CPU), a general-purpose Processor, a Digital Signal Processor (DSP), an ASIC, an FPGA or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. A processor may also be a combination of computing functions, e.g., comprising one or more microprocessors, a combination of a DSP and a microprocessor, and the like. The communication module 511 may be a transceiver, an RF circuit or a communication interface, etc. The storage module 513 may be a memory.
All relevant contents of each scene related to the method embodiment may be referred to the functional description of the corresponding functional module, and are not described herein again. The speech recognition device 510 can perform the speech recognition method shown in fig. 2.
The above-described embodiments may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions or computer programs. The procedures or functions according to the embodiments of the present application are generated in whole or in part when a computer instruction or a computer program is loaded or executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire or wirelessly. Computer-readable storage media can be any available media that can be accessed by a computer or a data storage device, such as a server, data center, etc., that contains one or more collections of available media. The available media may be magnetic media (e.g., floppy disk, hard disk, magnetic tape), optical media (e.g., DVD), or semiconductor media. The semiconductor medium may be a solid state disk.
Fig. 6 is a block diagram of an electronic device according to an embodiment of the present application. As shown in fig. 6, electronic device 600 may include one or more of the following components: a processor 601, a memory 602 coupled to the processor 601, wherein the memory 602 may store one or more programs, and the one or more programs may be configured to implement the methods described in the embodiments as described above when executed by the one or more processors 601. The electronic device 600 may be a server in the voice recognition system.
Processor 601 may include one or more processing cores. The processor 601 connects various parts throughout the electronic device 600 using various interfaces and lines, and performs various functions of the electronic device 600 and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 602, and calling data stored in the memory 602. Alternatively, the processor 601 may be implemented in at least one hardware form of Digital Signal Processing (DSP), field-programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 601 may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a passenger interface, an application program and the like; the GPU is used for rendering and drawing display content; the modem is used to handle wireless communications. It is understood that the above modem may not be integrated into the processor 601, but may be implemented by a communication chip.
The Memory 602 may include a Random Access Memory (RAM) or a Read-Only Memory (ROM). The memory 602 may be used to store instructions, programs, code, sets of codes, or sets of instructions. The memory 602 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the various method embodiments described above, and the like. The stored data area may also store data created during use by the electronic device 600, and the like.
It is understood that the electronic device 600 may include more or less structural elements than those shown in the above structural block diagrams, for example, a power module, a physical button, a Wireless Fidelity (WiFi) module, a speaker, a bluetooth module, a sensor, etc., which are not limited herein.
Embodiments of the present application also provide a computer storage medium, in which a computer program/instructions are stored, and when executed by a processor, implement part or all of the steps of any one of the methods as described in the above method embodiments.
An embodiment of the present application further provides a computer program product, which includes a computer program/instruction, and when executed by a processor, the computer program/instruction implements the steps of the method according to the first aspect of the embodiment of the present application.
It should be understood that, in the various embodiments of the present application, the sequence numbers of the above-mentioned processes do not imply any order of execution, and the order of execution of the processes should be determined by their functions and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.
In the several embodiments provided in the present application, it should be understood that the disclosed method, apparatus, and system may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative; for example, the division of the unit is only a logic function division, and there may be another division manner in actual implementation; for example, various elements or components may be combined or may be integrated in another system or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be physically included alone, or two or more units may be integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute some steps of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: u disk, removable hard drive, diskette, optical disk, volatile memory or non-volatile memory. The non-volatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example and not limitation, many forms of Random Access Memory (RAM) are available, such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), synchronous SDRAM (SLDRAM), and direct bus RAM (DR RAM) among various media that can store program code.
Although the present invention is disclosed above, the present invention is not limited thereto. Any person skilled in the art can easily think of changes or substitutions without departing from the spirit and scope of the invention, and all changes and modifications can be made, including different combinations of functions, implementation steps, software and hardware implementations, all of which are included in the scope of the invention.
Claims (10)
1. A voice recognition method is applied to a server in a voice recognition system, the voice recognition system comprises the server and terminal equipment for voice interaction of a user, the server comprises a man-machine interaction engine supporting man-machine voice interaction, and the method comprises the following steps:
calling the human-computer interaction engine to interact with the user through the terminal equipment, and acquiring target voice information input by the user in the interaction process; performing character recognition on the target voice information to obtain a first text;
performing scene recognition on the first text, and determining a target service scene corresponding to the first text, wherein the target service scene is used for representing a service type which is expressed by the first text and needs to be provided;
extracting scene associated words from the first text to obtain target scene associated words corresponding to the first text, wherein the target scene associated words are used for representing the service content of the service type required to be provided and expressed by the first text;
performing scene hot word set query according to the target service scene to obtain a target scene hot word set corresponding to the target service scene;
performing pinyin comparison on the target scene associated word and the scene hot words in the target scene hot word set to obtain a difference value score between the target scene associated word and the scene hot words in the target scene hot word set, wherein the scene hot words are vocabularies with the heat degree greater than a heat degree threshold value, and the heat degree refers to the query heat degree of the vocabularies in all users;
determining a target scene hot word with the highest difference value score in the target scene hot word set;
replacing the target scene associated words in the first text with the target scene hot words to obtain a second text;
determining the user intention expressed by the target voice information according to the second text; and the number of the first and second groups,
and executing corresponding service operation according to the determined user intention.
2. The method according to claim 1, wherein the obtaining a difference score between the target scene associated word and the scene hotword in the target scene hotword set by performing pinyin comparison between the target scene associated word and the scene hotword in the target scene hotword set comprises:
determining whether a first vocabulary completely identical to the pinyin of the target scene associated word exists in the target scene hot word set;
if yes, determining whether the number of the first vocabulary is larger than 1;
if yes, determining whether the user inquires about the first vocabulary once;
if yes, determining whether the number of the second words inquired in the first words is larger than 1;
if yes, determining whether the time interval between the query time of the user for each second vocabulary and the current time is greater than a preset interval;
if yes, determining that the difference value score of the scene hot word with the largest query frequency in the second vocabulary is the highest;
if not, determining that the difference value score of the scene hot word with the shortest time interval between the query time and the current time in the second vocabulary is the highest.
3. The method according to claim 2, wherein after determining whether a first vocabulary completely identical to the pinyin of the target scene associated word exists in the target scene hot word set, if the first vocabulary does not exist, performing pinyin replacement on the pinyin of the target scene associated word to obtain a replaced pinyin; comparing the replaced pinyin with the scene hot words in the target scene hot word set to obtain a difference value score between the target scene associated word and the scene hot words in the target scene hot word set; and (c) a second step of,
after determining whether the number of the first vocabulary is greater than 1, if the number of the first vocabulary is equal to 1, determining that the difference score of the first vocabulary is the highest; and (c) a second step of,
after determining whether the user has queried the first vocabulary once, if the user has not queried the first vocabulary once, determining that the difference value score of the scene hot word with the highest degree in the first vocabulary is the highest; and (c) a second step of,
after determining whether the number of the second vocabulary inquired in the first vocabulary is greater than 1, if the number of the second vocabulary is equal to 1, determining that the difference score of the second vocabulary is the highest.
4. The method as claimed in claim 3, wherein the performing pinyin replacement on the pinyin of the target scene associated word to obtain the replaced pinyin comprises:
determining a native and/or living address of the user;
determining pronunciation characteristics corresponding to the native place and/or the life address;
determining the number of pinyin capable of being subjected to pinyin replacement in each pinyin corresponding to the target scene associated word according to the pronunciation characteristics;
and if the pinyin number is larger than 1, sequentially performing pinyin replacement according to the appearance sequence of each character needing pinyin replacement in the target scene associated word to obtain a plurality of replaced pinyins.
5. The method according to claim 4, wherein the comparing the replaced pinyin with the scene hotwords in the target scene hotword set to obtain a difference score between the target scene associated word and the scene hotwords in the target scene hotword set includes:
determining whether a target alternative pinyin which is identical to the pinyin of the scene hot word in the target scene hot word set exists in the multiple alternative pinyins;
if yes, determining the number of the target alternative pinyin;
if the number of the target alternative pinyins is 1, determining that the difference value score of the scene hot word corresponding to the target alternative pinyins is the highest;
if the number of the target alternative pinyins is at least two, calculating the difference value score of the scene hot word corresponding to each target alternative pinyins according to the number of the substituted pinyins in the target alternative pinyins and the use times or the heat of the scene hot word corresponding to the target alternative pinyins by the user.
6. The method of claim 5, wherein the calculating a difference score of the scene hotword corresponding to each target replacement pinyin according to the number of the replaced pinyins in the target replacement pinyin and the number of times or the degree of hotness of the user for using the scene hotword corresponding to the target replacement pinyin comprises:
determining a first pinyin with the least replaced pinyin in the target replaced pinyins, and determining whether the number of the first pinyins is more than 1;
if the number of the first pinyin is larger than 1, determining whether the user uses the scene hotword corresponding to the first pinyin;
if the user uses the scene hot word corresponding to the first pinyin, determining whether the number of third words used by the user in the scene hot word corresponding to the first pinyin is greater than 1;
if the number of the third vocabulary is larger than 1, determining that the difference value score of the scene hot word with the highest use frequency or the highest heat degree in the third vocabulary is the highest;
if the number of the third vocabulary is equal to 1, determining that the difference value score of the third vocabulary is the highest;
if the user does not use the scene hot word corresponding to the first pinyin, determining that the difference value score of the scene hot word with the highest heat in the scene hot words corresponding to the first pinyin is the highest;
and if the number of the first Pinyin is equal to 1, determining that the difference value score of the scene hotword corresponding to the first Pinyin is the highest.
7. The method of claim 4, wherein prior to determining the user's native and/or living address, the method further comprises:
obtaining the mandarin level of the user;
determining that the Mandarin level does not reach a preset level.
8. A speech recognition apparatus, for a server in a speech recognition system, the speech recognition system including a terminal device for performing speech interaction between the server and a user, the server including a human-machine interaction engine supporting human-machine speech interaction, the apparatus comprising:
the acquisition unit is used for calling the human-computer interaction engine to interact with the user through the terminal equipment, and acquiring target voice information input by the user in the interaction process; performing character recognition on the target voice information to obtain a first text;
a scene recognition unit, configured to perform scene recognition on the first text, and determine a target service scene corresponding to the first text, where the target service scene is used to represent a service type that needs to be provided and is expressed by the first text;
the scene associated word extracting unit is used for extracting a scene associated word from the first text to obtain a target scene associated word corresponding to the first text, wherein the target scene associated word is used for representing the service content of the service type required to be provided and expressed by the first text;
the scene hot word set query unit is used for carrying out scene hot word set query according to the target service scene to obtain a target scene hot word set corresponding to the target service scene;
the comparison unit is used for performing pinyin comparison on the target scene associated word and the scene hot words in the target scene hot word set to obtain a difference value score between the target scene associated word and the scene hot words in the target scene hot word set, wherein the scene hot words are words with the heat degree greater than a heat degree threshold value, and the heat degree refers to the query heat degree of the words in all users;
the first determining unit is used for determining a target scene hotword with the highest score of the difference value in the target scene hotword set;
the replacing unit is used for replacing the target scene associated words in the first text with the target scene hot words to obtain a second text;
a second determining unit, configured to determine, according to the second text, a user intention expressed by the target speech information; and (c) a second step of,
and the service unit is used for executing corresponding service operation according to the determined user intention.
9. An electronic device comprising a processor, memory, and one or more programs stored in the memory and configured to be executed by the processor, the programs including instructions for performing the steps of the method of any of claims 1-7.
10. A computer-readable storage medium, on which a computer program/instructions is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211487069.2A CN115547337B (en) | 2022-11-25 | 2022-11-25 | Speech recognition method and related product |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211487069.2A CN115547337B (en) | 2022-11-25 | 2022-11-25 | Speech recognition method and related product |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115547337A true CN115547337A (en) | 2022-12-30 |
CN115547337B CN115547337B (en) | 2023-03-03 |
Family
ID=84719741
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211487069.2A Active CN115547337B (en) | 2022-11-25 | 2022-11-25 | Speech recognition method and related product |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115547337B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115860823A (en) * | 2023-03-03 | 2023-03-28 | 深圳市人马互动科技有限公司 | Data processing method in human-computer interaction questionnaire answering scene and related product |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106030699A (en) * | 2014-10-09 | 2016-10-12 | 谷歌公司 | Hotword detection on multiple devices |
CN109346060A (en) * | 2018-11-28 | 2019-02-15 | 珂伯特机器人(天津)有限公司 | Audio recognition method, device, equipment and storage medium |
CN109920432A (en) * | 2019-03-05 | 2019-06-21 | 百度在线网络技术(北京)有限公司 | A kind of audio recognition method, device, equipment and storage medium |
US20200035217A1 (en) * | 2019-08-08 | 2020-01-30 | Lg Electronics Inc. | Method and device for speech processing |
CN111292745A (en) * | 2020-01-23 | 2020-06-16 | 北京声智科技有限公司 | Method and device for processing voice recognition result and electronic equipment |
CN113160822A (en) * | 2021-04-30 | 2021-07-23 | 北京百度网讯科技有限公司 | Speech recognition processing method, speech recognition processing device, electronic equipment and storage medium |
CN113223516A (en) * | 2021-04-12 | 2021-08-06 | 北京百度网讯科技有限公司 | Speech recognition method and device |
US20220092276A1 (en) * | 2020-09-22 | 2022-03-24 | Samsung Electronics Co., Ltd. | Multimodal translation method, apparatus, electronic device and computer-readable storage medium |
US20220165277A1 (en) * | 2020-11-20 | 2022-05-26 | Google Llc | Adapting Hotword Recognition Based On Personalized Negatives |
-
2022
- 2022-11-25 CN CN202211487069.2A patent/CN115547337B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106030699A (en) * | 2014-10-09 | 2016-10-12 | 谷歌公司 | Hotword detection on multiple devices |
CN109346060A (en) * | 2018-11-28 | 2019-02-15 | 珂伯特机器人(天津)有限公司 | Audio recognition method, device, equipment and storage medium |
CN109920432A (en) * | 2019-03-05 | 2019-06-21 | 百度在线网络技术(北京)有限公司 | A kind of audio recognition method, device, equipment and storage medium |
US20200035217A1 (en) * | 2019-08-08 | 2020-01-30 | Lg Electronics Inc. | Method and device for speech processing |
CN111292745A (en) * | 2020-01-23 | 2020-06-16 | 北京声智科技有限公司 | Method and device for processing voice recognition result and electronic equipment |
US20220092276A1 (en) * | 2020-09-22 | 2022-03-24 | Samsung Electronics Co., Ltd. | Multimodal translation method, apparatus, electronic device and computer-readable storage medium |
US20220165277A1 (en) * | 2020-11-20 | 2022-05-26 | Google Llc | Adapting Hotword Recognition Based On Personalized Negatives |
CN113223516A (en) * | 2021-04-12 | 2021-08-06 | 北京百度网讯科技有限公司 | Speech recognition method and device |
CN113160822A (en) * | 2021-04-30 | 2021-07-23 | 北京百度网讯科技有限公司 | Speech recognition processing method, speech recognition processing device, electronic equipment and storage medium |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115860823A (en) * | 2023-03-03 | 2023-03-28 | 深圳市人马互动科技有限公司 | Data processing method in human-computer interaction questionnaire answering scene and related product |
Also Published As
Publication number | Publication date |
---|---|
CN115547337B (en) | 2023-03-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10417344B2 (en) | Exemplar-based natural language processing | |
CN110069608B (en) | Voice interaction method, device, equipment and computer storage medium | |
US9930167B2 (en) | Messaging application with in-application search functionality | |
US20190147052A1 (en) | Method and apparatus for playing multimedia | |
US20180286459A1 (en) | Audio processing | |
US8170537B1 (en) | Playing local device information over a telephone connection | |
US20110167350A1 (en) | Assist Features For Content Display Device | |
US10313713B2 (en) | Methods, systems, and media for identifying and presenting users with multi-lingual media content items | |
US20130268826A1 (en) | Synchronizing progress in audio and text versions of electronic books | |
US20240070217A1 (en) | Contextual deep bookmarking | |
CN112102841B (en) | Audio editing method and device for audio editing | |
AU2006325555B2 (en) | A method and apparatus for accessing a digital file from a collection of digital files | |
CN105912586B (en) | Information searching method and electronic equipment | |
KR101567449B1 (en) | E-Book Apparatus Capable of Playing Animation on the Basis of Voice Recognition and Method thereof | |
CN115547337B (en) | Speech recognition method and related product | |
KR102353797B1 (en) | Method and system for suppoting content editing based on real time generation of synthesized sound for video content | |
CN105684012B (en) | Providing contextual information | |
CN112825088A (en) | Information display method, device, equipment and storage medium | |
CN113360127B (en) | Audio playing method and electronic equipment | |
JP7229296B2 (en) | Related information provision method and system | |
US20140297285A1 (en) | Automatic page content reading-aloud method and device thereof | |
CN115687807A (en) | Information display method, device, terminal and storage medium | |
CN112837668B (en) | Voice processing method and device for processing voice | |
CN114630179A (en) | Audio extraction method and electronic equipment | |
JP7562610B2 (en) | Content editing support method and system based on real-time generation of synthetic sound for video content |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |