CN114964300A - Voice recognition method and navigation device - Google Patents

Voice recognition method and navigation device Download PDF

Info

Publication number
CN114964300A
CN114964300A CN202210713257.6A CN202210713257A CN114964300A CN 114964300 A CN114964300 A CN 114964300A CN 202210713257 A CN202210713257 A CN 202210713257A CN 114964300 A CN114964300 A CN 114964300A
Authority
CN
China
Prior art keywords
voice
character
characters
matching
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210713257.6A
Other languages
Chinese (zh)
Other versions
CN114964300B (en
Inventor
王晨光
周帅
杨国荣
晏承彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Zhiyuanlian Technology Co ltd
Original Assignee
Shenzhen Zhiyuanlian Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Zhiyuanlian Technology Co ltd filed Critical Shenzhen Zhiyuanlian Technology Co ltd
Priority to CN202210713257.6A priority Critical patent/CN114964300B/en
Publication of CN114964300A publication Critical patent/CN114964300A/en
Application granted granted Critical
Publication of CN114964300B publication Critical patent/CN114964300B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01CMEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
    • G01C21/00Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00
    • G01C21/26Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00 specially adapted for navigation in a road network
    • G01C21/34Route searching; Route guidance
    • G01C21/36Input/output arrangements for on-board computers
    • G01C21/3605Destination input or retrieval
    • G01C21/3608Destination input or retrieval using speech input, e.g. using speech recognition
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01CMEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
    • G01C21/00Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00
    • G01C21/26Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00 specially adapted for navigation in a road network
    • G01C21/34Route searching; Route guidance
    • G01C21/3446Details of route searching algorithms, e.g. Dijkstra, A*, arc-flags, using precalculated routes
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01CMEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
    • G01C21/00Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00
    • G01C21/26Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00 specially adapted for navigation in a road network
    • G01C21/34Route searching; Route guidance
    • G01C21/3453Special cost functions, i.e. other than distance or default speed limit of road segments
    • G01C21/3476Special cost functions, i.e. other than distance or default speed limit of road segments using point of interest [POI] information, e.g. a route passing visible POIs
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Navigation (AREA)

Abstract

The invention provides a voice recognition method and a navigation device, which enter a voice monitoring state by receiving a voice awakening instruction, detect the end position of a voice signal, wherein the end position comprises the starting point position and the end point position of the voice signal, the voice monitoring state is ended after the end point position of the voice signal is detected, the voice signal between the starting point position and the end point position is processed to obtain an effective voice signal, the effective voice signal is input into a voice recognition model to be converted into corresponding text information, the text information is a character string consisting of one or more characters, the text information is respectively input into a skip character matching function, a full character matching function and/or a near sound matching function to inquire matched interest point names in an interest point database, and skip character matching, short character matching and short sound matching are respectively output, The interest point list corresponding to the matching result of the whole word matching and/or the near sound matching can improve the success rate of voice recognition of the non-popular interest points.

Description

Voice recognition method and navigation device
Technical Field
The present invention relates to the field of electronic information technologies, and in particular, to a speech recognition method and a navigation device.
Background
Along with the development of technologies such as map, location, navigation, the navigation function of navigation equipment such as cell-phone, vehicle navigation device etc. is more and more powerful, provides very big convenience for people's trip, and wherein speech recognition technology and speech control technology have realized letting navigating mate liberate both hands especially, through sending out the speech control instruction, can realize the control to navigation equipment to be absorbed in and drive the car, guarantee driving safety. The point-of-interest search is one of the most commonly used voice control functions on a navigation device, and for example, searches a specific place such as a mall, a restaurant, or a park and sets it as a navigation destination, thereby grasping information such as a destination distance, a travel route, a midway road condition, and arrival time. The speech recognition technology is based on an artificial intelligence learning algorithm, a speech recognition model of an interest point can be constructed by acquiring speech control data of a large number of users as a learning sample, the success rate of speech recognition of the interest point is greatly improved, and particularly for hot interest points, even if imported speech information is not clear due to environmental noise or reasons of too fast speech speed and ambiguous pronunciation of the users, the interest point can be generally accurately recognized and displayed in a recommendation list through a fuzzy matching algorithm. However, the success rate of speech recognition on non-popular interest points in the prior art is not high, which is a problem that is difficult to solve. Because each person has different speaking habits and varies in terms of speed, intonation, accent and the like, the artificial intelligence learning algorithm has a very limited effect on improving the success rate of speech recognition of non-popular interest points under the condition that a large number of learning samples of the same user cannot be obtained.
Disclosure of Invention
The invention provides a voice recognition method and a navigation device based on the problems, and the success rate of voice recognition of non-popular interest points can be improved.
In view of the above, a first aspect of the present invention provides a speech recognition method, including:
receiving a voice awakening instruction to enter a voice monitoring state;
detecting an end point position of a voice signal, wherein the end point position comprises a starting point position and an end point position of the voice signal;
ending the voice monitoring state after detecting the end point position of the voice signal;
processing the voice signal between the starting point position and the end point position to obtain an effective voice signal;
inputting the effective voice signal into a voice recognition model to be converted into corresponding text information, wherein the text information is a character string formed by one or more characters;
inputting the character information into a skip character matching function, a full character matching function and/or a near sound matching function respectively to inquire the matched interest point name in an interest point database;
and respectively outputting interest point lists corresponding to matching results of word skipping matching, full word matching and/or near sound matching.
Further, in the above speech recognition method, the step of inputting the text information to the skip word matching function to query the matched interest point name in the interest point database specifically includes:
determining whether a word skipping interval exists in the character information;
if yes, inserting single-character wildcards corresponding to the quantity of the skip character intervals at the position where the skip character intervals exist in the character information, wherein the single-character wildcards represent 1 arbitrary character;
inserting general wildcards into the head and the tail of the character information respectively to generate query keywords, wherein the general wildcards represent any number of any characters;
and querying the matched interest point names in the interest point database by using the query keywords.
Further, in the above speech recognition method, the step of determining whether there is a skip interval in the text information specifically includes:
performing time domain segmentation operation on the effective voice signal to obtain a plurality of voice segments with the same duration, wherein the number of the voice segments is greater than or equal to the number of characters in the character information;
establishing an association relation between each character in the character information and a voice segment with the short-time average energy larger than a preset first threshold value in the plurality of voice segments according to the time domain correspondence relation, wherein each character is associated with one or more voice segments;
counting the number of the associated average voice segments by taking one character as a unit;
counting the number of voice fragments of unassociated characters between adjacent characters in the character information;
when the number of voice segments of unassociated characters between any two adjacent characters is larger than a preset second threshold value, determining that a word skipping interval exists between the two adjacent characters, wherein the second threshold value is larger than the average number of voice segments.
Further, in the above speech recognition method, the step of counting the number of speech segments of an unassociated word between adjacent words in the word information specifically includes counting the number of speech segments of an unassociated word between adjacent words, where the short-term average energy is greater than a preset first threshold, and the step of determining that a skip interval exists between any two adjacent words when the number of speech segments of an unassociated word between any two adjacent words is greater than a preset second threshold specifically includes:
and when the number of the voice segments of the unassociated characters between any two adjacent characters is larger than a preset second threshold value and the number of the voice segments of which the short-time average energy between the two adjacent characters is larger than a preset first threshold value is larger than a preset third threshold value, determining that a character skipping interval exists between the two adjacent characters.
Further, in the above speech recognition method, after the step of determining that there is a skip word interval between two adjacent words, the method further includes calculating the number of skip word intervals between two adjacent words according to the number of speech segments of an unassociated word between any two adjacent words.
Further, in the above speech recognition method, when the number of skip intervals between the two adjacent words is greater than 2, it is prompted that speech recognition fails and a speech monitoring state is re-entered.
Further, in the above speech recognition method, after the step of counting the number of speech segments with unassociated characters between adjacent characters and short-time average energy greater than the preset first threshold, the method further includes:
counting the number of continuous voice segments of which the short-time average energy of the unassociated characters between the adjacent characters in the character information is greater than a preset first threshold;
and when the number of the continuous voice segments is smaller than a preset fourth threshold value, executing the step of inputting the text information into a full-word matching function to inquire the matched interest point name in the interest point database and outputting an interest point list corresponding to the matching result of the full-word matching, otherwise, not executing the step of inputting the text information into the full-word matching function to inquire the matched interest point name in the interest point database and outputting the interest point list corresponding to the matching result of the full-word matching.
Further, in the above speech recognition method, the step of inputting the text information to a full word matching function to query the interest point database for a matched interest point name specifically includes:
inserting general wildcards into the head and the tail of the character information respectively to generate query keywords, wherein the general wildcards represent any number of any characters;
and querying the matched interest point names in the interest point database by using the query keywords.
Further, in the above speech recognition method, the step of inputting the text information to the near-sound matching function to query the interest point database for the matched interest point name specifically includes:
acquiring homophone characters and near-tone characters of each character in the character information;
respectively replacing corresponding characters in the character information with the homophone characters and the near-phonetic characters to generate near-phonetic character information;
inserting generic wildcards into the head and tail of the text information and the near-sound text information respectively and splicing the generic wildcards in an OR relationship to generate query keywords, wherein the generic wildcards represent any number of any characters;
and querying the matched interest point names in the interest point database by using the query keywords.
A second aspect of the present invention provides a navigation device, including a display unit, a positioning unit, a storage unit, and a processing unit, where the processing unit is configured to execute a computer program stored in the storage unit to implement any of the voice recognition methods provided in the first aspect of the present invention.
The invention provides a voice recognition method and a navigation device, which enter a voice monitoring state by receiving a voice awakening instruction, detect the end position of a voice signal, wherein the end position comprises the starting point position and the end point position of the voice signal, the voice monitoring state is ended after the end point position of the voice signal is detected, the voice signal between the starting point position and the end point position is processed to obtain an effective voice signal, the effective voice signal is input into a voice recognition model to be converted into corresponding text information, the text information is a character string consisting of one or more characters, the text information is respectively input into a skip character matching function, a full character matching function and/or a near sound matching function to inquire matched interest point names in an interest point database, and skip character matching, short character matching and short sound matching are respectively output, The interest point list corresponding to the matching result of the whole word matching and/or the near sound matching can improve the success rate of voice recognition of the non-popular interest points.
Drawings
FIG. 1 is a schematic flow chart of a speech recognition method provided by one embodiment of the present invention;
FIG. 2 is a schematic flow chart diagram of a word skipping matching method according to an embodiment of the present invention;
fig. 3 is a schematic flow chart of a method for determining a skip interval according to an embodiment of the present invention.
Detailed Description
In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced otherwise than as specifically described herein, and thus the scope of the present invention is not limited by the specific embodiments disclosed below.
In the description of the present invention, the terms "plurality" or "a plurality" refer to two or more, and unless otherwise specifically limited, the terms "upper", "lower", and the like indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are merely for convenience in describing the present invention and simplifying the description, but do not indicate or imply that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and thus should not be construed as limiting the present invention. The terms "connected," "mounted," "secured," and the like are to be construed broadly and include, for example, fixed connections, removable connections, or integral connections; may be directly connected or indirectly connected through an intermediate. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations. Furthermore, the terms "first", "second", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first," "second," etc. may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless otherwise specified.
In the description herein, reference to the term "one embodiment," "some embodiments," "specific examples," or the like, means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
A speech recognition method and a navigation device according to some embodiments of the present invention are described below with reference to fig. 1 to 3.
As shown in fig. 1, a first aspect of the present invention provides a speech recognition method, including:
s100: and receiving a voice wake-up instruction to enter a voice monitoring state. The voice recognition method provided by the invention is applied to a navigation device, a user enables the navigation device to enter a voice monitoring state through a default or self-defined voice awakening instruction in the navigation device, and the navigation device continuously monitors surrounding voice information to recognize a control instruction from the voice information so as to execute corresponding operation in the voice monitoring state. The navigation device displays a monitoring icon on a screen in a voice monitoring state to prompt a user that a voice control instruction can be sent currently, and when the navigation device does not detect voice information within a preset time, such as 10 seconds, the navigation device exits from the voice monitoring state to avoid the voice of an external environment or irrelevant conversation sound from triggering a control system of the navigation device to execute misoperation.
S200: detecting an end point position of a voice signal, wherein the end point position comprises a starting point position and an end point position of the voice signal. The navigation device continuously analyzes the frequency domain characteristics and the short-time energy characteristics of the ambient sound information in a voice monitoring state, when the ambient sound information contains sound information which accords with the human voice frequency domain characteristics and the short-time energy characteristics, the navigation device determines that a voice signal is detected in the ambient sound information, determines the starting time of the voice signal in the ambient sound information as the starting position of the voice signal, and determines the ending time point of the voice signal in the ambient sound information as the ending position of the voice signal when the duration of the voice signal which is not contained in the ambient sound information is longer than a preset time, for example, when the voice signal is not detected in the ambient sound information for more than 3 seconds.
S300: and ending the voice monitoring state after detecting the end point position of the voice signal. When the end point position of the voice signal is detected, it indicates that the navigation device does not detect the voice signal after a preset time period, that is, if the voice signal is a voice control instruction sent by a user, it is detected that the end point position of the voice signal represents that the user has completely spoken the voice control instruction, and at this time, the navigation device may end the voice monitoring state.
S400: and processing the voice signal between the starting point position and the end point position to obtain an effective voice signal. The navigation device samples and performs analog-to-digital conversion on original sound data signals through a sound acquisition unit such as a microphone, and the like, and the original sound data signals further include a large amount of noise information besides the voice signals, such as various noises generated by vibration, friction and collision of peripheral objects, airflow and the like, and noises generated by respiration, movement and conversation of peripheral personnel.
S500: and inputting the effective voice signal into a voice recognition model to be converted into corresponding text information, wherein the text information is a character string formed by one or more characters. And inputting the effective voice signal into a pre-trained voice recognition model to convert the effective voice signal into corresponding text information. In some embodiments of the invention, the speech recognition model is integrated in the navigation device, which performs speech recognition directly locally to convert the valid speech information into corresponding text. In other embodiments of the present invention, the navigation device executes voice recognition through the voice recognition model in the cloud server, and can complete the voice recognition more efficiently and accurately by using the powerful computing resources of the cloud, and meanwhile, a large number of voice data samples are collected at the cloud to continuously optimize the voice recognition model. After the effective voice signal is converted into characters, the effective voice signal can be a character or a character string consisting of a plurality of characters. When the valid voice signal is converted into text information and only contains one text, the voice information may be input again in error, or the navigation device may recognize the error, and in this case, the navigation device ends the processing flow and waits for the user to input a voice wake-up command next time. Furthermore, since the name of the point of interest is generally a geographical location name such as a street name, a cell name, a building name, or a business name such as a restaurant/shop/supermarket, and the name of the point of interest itself generally does not have a specific meaning, and may be a character string formed by any two or more characters without semantic relation, the recognition of the name of the point of interest cannot use semantic recognition or perform semantic verification on the character string forming the character information by using a dictionary, a vocabulary, and the like, when the navigation device converts the effective voice information into the character information by using the voice recognition model and has a plurality of possible results, the character string formed by the character corresponding to the result with the highest matching degree is selected as the character information, the matching degree includes the same pronunciation or the closest proximity, when a plurality of characters with the same pronunciation and corresponding name of the character can be found in the point of interest database by using the combination of the plurality of characters with the same pronunciation, the closest one is selected according to the distance from the current position.
S600: and respectively inputting the text information into a skip word matching function, a full word matching function and/or a near sound matching function to inquire the matched interest point name in an interest point database. The skip word matching function is used for searching the interest point name containing the character information in a discontinuous state in the interest point database according to the key word generated by the character information. The full word matching function is used for searching the interest point database for the interest point name containing the text information in a continuous state according to the keywords generated by the text information. And the near sound matching function is used for searching the interest point name which is the same as or similar to the pronunciation of the character information in the interest point database according to the keyword generated by the character information.
S700: and respectively outputting interest point lists corresponding to matching results of word skipping matching, full word matching and/or near sound matching. The number of the interest point names included in the interest point name list corresponding to the text information output by any one of the skip word matching function, the full word matching function, and the nearing sound matching function may be 0, or may be any number of 1 or more than 1. Further, when the number of the interest point names matched by any one of the three matching functions exceeds 5, the top 5 interest point names are selected according to a preset sorting rule to be output. Furthermore, the navigation device displays three matching result output areas in parallel on the screen, and is used for respectively displaying the interest point lists output by the skip word matching function, the full word matching function and the near sound matching function, and when the number of the interest point names matched by any one matching function is 0, the corresponding matching result output area is not displayed.
By adopting the technical scheme of the embodiment, the words obtained by voice recognition are matched with the interest point names in the interest point database in the mode of jumping word matching, full word matching and near sound matching, so that the success rate of voice recognition of non-popular interest points can be improved.
As shown in fig. 2, in the above speech recognition method, the step of inputting the text information to the skip word matching function to query the interest point database for the matched interest point name specifically includes:
s610: and determining whether a word skipping interval exists in the character information. For example, when the user speaks the name "new garden", the "new" word and the "generation" word are generally used at a higher volume, while the "time" word may be used at a lower volume, and once the user sped up faster, the "time" word may be omitted, or the pronunciation is covered by noise, so that the valid voice signal received by the voice recognition model only contains the sound information of "new garden" and only the four words of "new garden" are left after being converted into the text information. By analyzing the effective voice signal, whether a word skipping interval exists between any two adjacent words in the word information can be determined.
S620: and if so, inserting single-character wildcards corresponding to the quantity of the skip character intervals at the position where the skip character intervals exist in the character information, wherein the single-character wildcards represent 1 arbitrary character. The symbol "? "as a single-character wildcard, each single-character wildcard represents 1 arbitrary character, and represents that 1 arbitrary character is inserted in the location of the wildcard, and the 1 arbitrary character may be one chinese character, 1 english letter or 1 character of other voices. Also taking "new generation garden" as an example, after the character information is inserted into the Chinese character type wildcard, it is "new? When the generation garden is used, the possible matching result is that the 'new generation garden', 'new modern garden', 'new generation garden' or 'new E generation garden' and the like have an interest point name with any character between the 'new' word and the 'generation' word. And when the character information is inserted into the simple font wildcard and then is 'new generation garden', the character information indicates that two interest point names of arbitrary characters exist between the 'new' character and the 'generation' character, such as 'New City time Garden'. In the technical scheme of the invention, the single-word wildcard does not contain any silent symbol in the expression of the spoken language, such as a short connector "-", an introduction "" "", various brackets such as a small bracket "()", a middle bracket "[ at ]), a big bracket" { } ", an angle bracket" < > "or a book title number" [ in ] ", and the like.
S630: and inserting general wildcards into the head and the tail of the character information respectively to generate the query keywords, wherein the general wildcards represent any number of any characters. Generally, the symbol "+" is used as a universal wildcard in query speech, each universal wildcard represents any number of arbitrary characters, and represents that 0 or more than 0 arbitrary characters are inserted into the position of the wildcard, and the arbitrary characters can be characters of Chinese characters, English letters or other speech. In "new? For example, after the text message is inserted into the generic wildcard, it is "? When "generation", it means a point of interest name having one arbitrary character between the "new" word and the "generation" word and having any number of arbitrary characters before and after the "new" word, such as "new garden", "new generation of south mountain", "latest E generation restaurant", etc. Similarly, in the technical solution of the present invention, the generic wildcards do not include any silent symbol in the spoken language expression.
S640: and querying the matched interest point names in the interest point database by using the query keywords. And inquiring in the interest point database by taking the character information into which the single-character wildcard characters and the extensive wildcard characters are inserted as inquiry key words so as to output the name of the interest point obtained by inquiry.
As shown in fig. 3, in the above speech recognition method, the step of determining whether the word skipping interval exists in the text information specifically includes:
s611: and performing time domain segmentation operation on the effective voice signal to obtain a plurality of voice segments with the same duration, wherein the number of the voice segments is greater than or equal to the number of characters in the character information. The effective voice signal is divided into a plurality of voice segments with the same time by taking preset time as a unit, under a normal speed of speech, three to five characters are generally contained in one second, in order to enable each voice segment to contain enough voice information and enable the number of the divided voice segments to be larger than or equal to the number of characters in the character information, and in consideration of a certain pause time between every two characters, the time length value of the voice segments is preferably 50 milliseconds to 100 milliseconds.
S612: and establishing an association relationship between each character in the character information and the voice segments of which the short-time average energy is larger than a preset first threshold value in the plurality of voice segments according to the time domain correspondence, wherein each character is associated with one or more voice segments. Each character in the character information is obtained by converting the voice information in the effective voice signal, so that each character has a corresponding relation with the voice segment in a specific time period in the effective voice signal. The segmented voice segments comprise voice segments with short-time average energy larger than a preset first threshold, and the voice segments carry effective voice information. The segmented speech segments further include speech segments having a short-time average energy smaller than a preset first threshold, which are typically located at pause positions between words or at beginning or end positions of the valid speech signal. And associating each character in the character information with the voice segment carrying the effective voice information to obtain a time domain boundary corresponding to each character.
S613: and counting the associated average voice fragment number by taking one character as a unit. And dividing the sum of the number of the voice segments corresponding to all the characters in the character information by the number of the characters in the character information to obtain the average number of the voice segments, and calculating the effective duration occupied by each average character in the voice instruction spoken by the user according to the average number of the voice segments and the duration of the voice segments, wherein the effective duration does not contain the pause time.
S614: and counting the number of voice fragments of the unassociated characters between the adjacent characters in the character information. The voice segments of the unassociated characters comprise voice segments of pause time and voice segments corresponding to character voice information which is low in volume or has fuzzy pronunciation and is not successfully recognized and converted.
S615: when the number of voice segments of unassociated characters between any two adjacent characters is larger than a preset second threshold value, determining that a word skipping interval exists between the two adjacent characters, wherein the second threshold value is larger than the average number of voice segments. When a user speaks the name of the interest point, the name of the interest point is generally stated in a relatively stable and coherent manner under the condition that the name of the interest point is not interfered, in this case, the pause time before a word and the word is generally smaller than the effective time occupied by the word, and when the effective voice signal has the time length which is larger than the time length corresponding to the average voice segment number between two adjacent words, the word skipping condition exists between the two words with a relatively high probability, so that the word skipping interval exists between the two words is determined.
S616: when the number of the voice segments of the unassociated characters between any two adjacent characters is smaller than a preset second threshold value, determining that no skip character interval exists between the two adjacent characters, not executing the steps of inserting single character type wildcards corresponding to the skip character interval number into the position of the skip character interval existing in the character information in the skip character matching function and the subsequent steps, and directly executing the steps of inputting the character information into the full character matching function to inquire the matched interest point name in the interest point database and the subsequent steps.
Further, in the above speech recognition method, the step of counting the number of speech segments of an unassociated word between adjacent words in the word information specifically includes counting the number of speech segments of an unassociated word between adjacent words, where the short-term average energy is greater than a preset first threshold, and the step of determining that a skip interval exists between any two adjacent words when the number of speech segments of an unassociated word between any two adjacent words is greater than a preset second threshold specifically includes:
and when the number of the voice segments of the unassociated characters between any two adjacent characters is larger than a preset second threshold value and the number of the voice segments of which the short-time average energy between the two adjacent characters is larger than a preset first threshold value is larger than a preset third threshold value, determining that a character skipping interval exists between the two adjacent characters. In this embodiment, when there are speech segments with short-term average energy greater than a preset first threshold between two adjacent words and the number of the speech segments is greater than a preset third threshold, it indicates that there is a high possibility that there is valid speech information that is not successfully converted into words in the corresponding position in the valid speech signal, in this case, it is determined that there is a word skipping interval between the two adjacent words. Further, the third threshold is greater than the average number of speech segments.
Further, in the above speech recognition method, after the step of determining that there is a skip word interval between two adjacent words, the method further includes calculating the number of skip word intervals between two adjacent words according to the number of speech segments of an unassociated word between any two adjacent words. In this embodiment, when the number of speech segments of the unassociated text between two adjacent texts is large, for example, larger than twice or three times of the average number of speech segments, the number of skip intervals between the two adjacent texts is large. Preferably, when the number of the voice segments with the short-time average energy between any two adjacent characters larger than the preset first threshold is larger than twice of the average voice segment number, a value rounded down by dividing the voice segment number of the unassociated character between the two adjacent characters by the average voice segment number is used as the number of the skip character interval.
Further, in the above speech recognition method, when the number of skip intervals between the two adjacent words is greater than 2, it is prompted that speech recognition fails and a speech monitoring state is re-entered. When the number of the skip word intervals between two adjacent words is more than two, it is indicated that the effective voice signal is not completely recognized and the voice recognition fails, and the user needs to input the voice control instruction again.
With reference to fig. 3, in the above speech recognition method, after the step of counting the number of speech segments between adjacent words, where the number of the speech segments is not related to a word and the short-term average energy is greater than the preset first threshold, the method further includes:
s617: and counting the number of continuous voice segments of which the short-time average energy of the unassociated characters between the adjacent characters in the character information is greater than a preset first threshold value. When the voice segments with the short-time average energy of the unassociated characters between the adjacent characters in the character information larger than the preset first threshold value are discontinuous voice segments, the voice information carried by the voice segments is more likely to be impulse noise which cannot be successfully removed in the denoising stage.
S618: and when the number of the continuous voice segments is smaller than a preset fourth threshold value, executing the step of inputting the text information into a full-word matching function to inquire the matched interest point name in the interest point database and outputting an interest point list corresponding to the matching result of the full-word matching, otherwise, not executing the step of inputting the text information into the full-word matching function to inquire the matched interest point name in the interest point database and outputting the interest point list corresponding to the matching result of the full-word matching. In this embodiment, it is determined that a skip interval exists between two adjacent words only when the speech segments with the short-time average energy of the unassociated words between the adjacent words in the word information being greater than the preset first threshold are continuous speech segments and the number of the continuous speech segments is greater than the preset fourth threshold. In this embodiment, when it is determined that the skip word interval exists in the text information, the step of inputting the text information into the full word matching function to query the matched interest point name and the corresponding output matching result in the interest point database is not performed, so as to avoid interference of the full word matching result to the user.
Further, in the above speech recognition method, the step of inputting the text information to a full word matching function to query the interest point database for a matched interest point name specifically includes:
inserting generic wildcards into the head and tail of the text information respectively to generate query keywords, wherein the generic wildcards represent any characters in any number;
and querying the matched interest point names in the interest point database by using the query keywords.
Further, in the above speech recognition method, the step of inputting the text information to the near-sound matching function to query the interest point database for the matched interest point name specifically includes:
acquiring homophone characters and near-tone characters of each character in the character information;
respectively replacing corresponding characters in the character information with the homophone characters and the near-phonetic characters to generate near-phonetic character information;
inserting generic wildcards into the head and tail of the text information and the near-sound text information respectively and splicing the generic wildcards in an OR relationship to generate query keywords, wherein the generic wildcards represent any number of any characters;
and querying the matched interest point names in the interest point database by using the query keywords.
Further, in the above speech recognition method, when the number of the interest point names in the matching result of the skip word matching is less than 5, the near-phonetic character information is input to the skip word matching function to query the interest point database for the matched interest point names.
A second aspect of the present invention provides a navigation device, including a display unit, a positioning unit, a storage unit, and a processing unit, where the processing unit is configured to execute a computer program stored in the storage unit to implement any of the voice recognition methods provided in the first aspect of the present invention.
The invention provides a voice recognition method and a navigation device, which enter a voice monitoring state by receiving a voice awakening instruction; detecting an end point position of a voice signal, wherein the end point position comprises a starting point position and an end point position of the voice signal; ending the voice monitoring state after detecting the end point position of the voice signal; processing the voice signal between the starting point position and the end point position to obtain an effective voice signal; inputting the effective voice signal into a voice recognition model to be converted into corresponding text information, wherein the text information is a character string formed by one or more characters; inputting the character information into a skip character matching function, a full character matching function and/or a near sound matching function respectively to inquire the matched interest point name in an interest point database; and respectively outputting interest point lists corresponding to matching results of word skipping matching, full word matching and/or near sound matching, so that the success rate of speech recognition of non-popular interest points can be improved.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
While embodiments in accordance with the invention have been described above, these embodiments are not intended to be exhaustive or to limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best utilize the invention and various embodiments with various modifications as are suited to the particular use contemplated. The invention is limited only by the claims and their full scope and equivalents.

Claims (10)

1. A speech recognition method, comprising:
receiving a voice awakening instruction to enter a voice monitoring state;
detecting an end point position of a voice signal, wherein the end point position comprises a starting point position and an end point position of the voice signal;
ending the voice monitoring state after detecting the end point position of the voice signal;
processing the voice signal between the starting point position and the end point position to obtain an effective voice signal;
inputting the effective voice signal into a voice recognition model to be converted into corresponding text information, wherein the text information is a character string formed by one or more characters;
inputting the character information into a skip character matching function, a full character matching function and/or a near sound matching function respectively to inquire the matched interest point name in an interest point database;
and respectively outputting interest point lists corresponding to matching results of word skipping matching, full word matching and/or near sound matching.
2. The speech recognition method of claim 1, wherein the step of inputting the textual information to a skip word matching function to query a point of interest database for matching point of interest names specifically comprises:
determining whether a word skipping interval exists in the character information;
if yes, inserting single-character wildcards corresponding to the quantity of the skip character intervals at the position where the skip character intervals exist in the character information, wherein the single-character wildcards represent 1 arbitrary character;
inserting general wildcards into the head and the tail of the character information respectively to generate query keywords, wherein the general wildcards represent any number of any characters;
and querying the matched interest point names in the interest point database by using the query keywords.
3. The speech recognition method of claim 2, wherein the step of determining whether the text message has a skip interval specifically comprises:
performing time domain segmentation operation on the effective voice signal to obtain a plurality of voice segments with the same duration, wherein the number of the voice segments is greater than or equal to the number of characters in the character information;
establishing an association relation between each character in the character information and a voice segment with the short-time average energy larger than a preset first threshold value in the plurality of voice segments according to the time domain correspondence relation, wherein each character is associated with one or more voice segments;
counting the number of the associated average voice segments by taking one character as a unit;
counting the number of voice fragments of unassociated characters between adjacent characters in the character information;
when the number of voice segments of unassociated characters between any two adjacent characters is larger than a preset second threshold value, determining that a word skipping interval exists between the two adjacent characters, wherein the second threshold value is larger than the average number of voice segments.
4. The speech recognition method according to claim 3, wherein the step of counting the number of speech segments of the unassociated text between adjacent texts in the text information specifically includes counting the number of speech segments of the unassociated text between adjacent texts and having a short-term average energy greater than a preset first threshold, and the step of determining that a skip interval exists between any two adjacent texts when the number of speech segments of the unassociated text between any two adjacent texts is greater than a preset second threshold specifically includes:
and when the number of the voice segments of the unassociated characters between any two adjacent characters is larger than a preset second threshold value and the number of the voice segments of which the short-time average energy between the two adjacent characters is larger than a preset first threshold value is larger than a preset third threshold value, determining that a character skipping interval exists between the two adjacent characters.
5. The speech recognition method of claim 4, further comprising, after the step of determining that there is a word skip interval between any two adjacent words, calculating the number of word skip intervals between any two adjacent words according to the number of speech segments of unassociated words between the two adjacent words.
6. The speech recognition method of claim 5, wherein when the number of skip intervals between the two adjacent words is greater than 2, the speech recognition is prompted to fail and the speech listening state is re-entered.
7. The speech recognition method according to claims 4 to 6, further comprising, after the step of counting the number of speech segments between adjacent words, which are not associated words and have a short-time average energy greater than a preset first threshold value:
counting the number of continuous voice segments of which the short-time average energy of the unassociated characters between the adjacent characters in the character information is greater than a preset first threshold;
and when the number of the continuous voice segments is smaller than a preset fourth threshold value, executing the step of inputting the text information into a full-word matching function to inquire the matched interest point name in the interest point database and outputting an interest point list corresponding to the matching result of the full-word matching, otherwise, not executing the step of inputting the text information into the full-word matching function to inquire the matched interest point name in the interest point database and outputting the interest point list corresponding to the matching result of the full-word matching.
8. The speech recognition method of claim 1, wherein the step of inputting the textual information into a full-word matching function to query the point-of-interest database for matching point-of-interest names specifically comprises:
inserting generic wildcards into the head and tail of the text information respectively to generate query keywords, wherein the generic wildcards represent any characters in any number;
and querying the matched interest point names in the interest point database by using the query keywords.
9. The speech recognition method of claim 1, wherein the step of inputting the textual information into a near-phonetic matching function to query the point-of-interest database for matching point-of-interest names specifically comprises:
obtaining homophone words and near-word words of each character in the character information;
respectively replacing corresponding characters in the character information with the homophone characters and the near-phonetic characters to generate near-phonetic character information;
inserting generic wildcards into the head and tail of the text information and the near-sound text information respectively and splicing the generic wildcards in an OR relationship to generate query keywords, wherein the generic wildcards represent any number of any characters;
and querying the matched interest point names in the interest point database by using the query keywords.
10. A navigation device comprising a display unit, a positioning unit, a storage unit, and a processing unit, wherein the processing unit is configured to execute a computer program stored in the storage unit to implement the speech recognition method according to any one of claims 1 to 9.
CN202210713257.6A 2022-06-22 2022-06-22 Voice recognition method and navigation device Active CN114964300B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210713257.6A CN114964300B (en) 2022-06-22 2022-06-22 Voice recognition method and navigation device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210713257.6A CN114964300B (en) 2022-06-22 2022-06-22 Voice recognition method and navigation device

Publications (2)

Publication Number Publication Date
CN114964300A true CN114964300A (en) 2022-08-30
CN114964300B CN114964300B (en) 2023-03-28

Family

ID=82965190

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210713257.6A Active CN114964300B (en) 2022-06-22 2022-06-22 Voice recognition method and navigation device

Country Status (1)

Country Link
CN (1) CN114964300B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090150156A1 (en) * 2007-12-11 2009-06-11 Kennewick Michael R System and method for providing a natural language voice user interface in an integrated voice navigation services environment
CN104535071A (en) * 2014-12-05 2015-04-22 百度在线网络技术(北京)有限公司 Voice navigation method and device
CN106847288A (en) * 2017-02-17 2017-06-13 上海创米科技有限公司 The error correction method and device of speech recognition text
CN108871370A (en) * 2018-07-03 2018-11-23 北京百度网讯科技有限公司 Air navigation aid, device, equipment and medium
CN109918485A (en) * 2019-01-07 2019-06-21 口碑(上海)信息技术有限公司 The method and device of speech recognition vegetable, storage medium, electronic device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090150156A1 (en) * 2007-12-11 2009-06-11 Kennewick Michael R System and method for providing a natural language voice user interface in an integrated voice navigation services environment
CN104535071A (en) * 2014-12-05 2015-04-22 百度在线网络技术(北京)有限公司 Voice navigation method and device
CN106847288A (en) * 2017-02-17 2017-06-13 上海创米科技有限公司 The error correction method and device of speech recognition text
CN108871370A (en) * 2018-07-03 2018-11-23 北京百度网讯科技有限公司 Air navigation aid, device, equipment and medium
CN109918485A (en) * 2019-01-07 2019-06-21 口碑(上海)信息技术有限公司 The method and device of speech recognition vegetable, storage medium, electronic device

Also Published As

Publication number Publication date
CN114964300B (en) 2023-03-28

Similar Documents

Publication Publication Date Title
KR100998566B1 (en) Method And Apparatus Of Translating Language Using Voice Recognition
CN108288467B (en) Voice recognition method and device and voice recognition engine
EP1617409B1 (en) Multimodal method to provide input to a computing device
US6910012B2 (en) Method and system for speech recognition using phonetically similar word alternatives
US8380505B2 (en) System for recognizing speech for searching a database
US9020819B2 (en) Recognition dictionary system and recognition dictionary system updating method
JP5835197B2 (en) Information processing system
US20080177541A1 (en) Voice recognition device, voice recognition method, and voice recognition program
US20060100871A1 (en) Speech recognition method, apparatus and navigation system
KR20210103002A (en) Speech synthesis method and apparatus based on emotion information
JP3476008B2 (en) A method for registering voice information, a method for specifying a recognition character string, a voice recognition device, a storage medium storing a software product for registering voice information, and a software product for specifying a recognition character string are stored. Storage media
CN111292751B (en) Semantic analysis method and device, voice interaction method and device, and electronic equipment
CN101162153A (en) Voice controlled vehicle mounted GPS guidance system and method for realizing same
JP2002524806A (en) Interactive user interface for networks using speech recognition and natural language processing
JPWO2005122144A1 (en) Speech recognition apparatus, speech recognition method, and program
JP2009237750A (en) Information search support device and information search support method
KR20200098079A (en) Dialogue system, and dialogue processing method
US11615787B2 (en) Dialogue system and method of controlling the same
CN114964300B (en) Voice recognition method and navigation device
JP2017151578A (en) Language setting system and language setting program
JP4808763B2 (en) Audio information collecting apparatus, method and program thereof
CN111798842B (en) Dialogue system and dialogue processing method
US10832675B2 (en) Speech recognition system with interactive spelling function
JP2005267092A (en) Correspondence analyzing device and navigation device
US20230178071A1 (en) Method for determining a vehicle domain and a speech recognition system for a vehicle

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant