WO2024067471A1

WO2024067471A1 - Speech recognition method, and server, speech recognition system and readable storage medium

Info

Publication number: WO2024067471A1
Application number: PCT/CN2023/121063
Authority: WO
Inventors: 李明洋
Original assignee: 广州小鹏汽车科技有限公司
Priority date: 2022-09-26
Filing date: 2023-09-25
Publication date: 2024-04-04
Also published as: CN115294964A; CN115294964B

Abstract

A speech recognition method, and a server, a speech recognition system and a readable storage medium. The speech recognition method comprises: acquiring a recognition text, which is obtained by means of performing speech recognition on a speech request (01); recognizing the recognition text according to a preset model, so as to obtain a first label text (02); when it is determined that the first label text does not satisfy a preset condition, performing correction processing on the first label text, and then generating a second label text (03); and generating a navigation result according to the recognition text and the second label text (04).

Description

Speech recognition method, server, speech recognition system and readable storage medium

This application claims priority to Chinese patent application No. 202211170954.8 filed on September 26, 2022, the entire contents of which are incorporated by reference into this application.

Technical Field

The present application relates to the field of vehicle navigation technology, and in particular to a speech recognition method, a server, a speech recognition system and a readable storage medium.

Background technique

When performing voice navigation, there may be situations where mixed languages involving minority languages need to be recognized. Language recognition of a single language is prone to the OOV (Out-of-vocabulary) problem. In related technologies, the OOV problem can be solved to a certain extent, but there is still a situation where sub-words contain incomplete information, resulting in a low effect of speech recognition for mixed languages.

Technical Solutions

The present application provides a speech recognition method, a server, a speech recognition system and a readable storage medium.

A speech recognition method of the present application includes:

Acquire a recognition text, where the recognition text is obtained by performing voice recognition on the voice request;

Recognize the recognition text according to a preset model to obtain a first label text;

When it is determined that the first label text does not meet the preset condition, the first label text is corrected and a second label text is generated;

A navigation result is generated according to the recognized text and the second label text.

In the above-mentioned speech recognition method, the first label text does not meet the preset conditions, which means that the result recognized by the preset model cannot represent the correct navigation intention. Therefore, correction processing is performed on the basis of the first label text and a second label text is generated, so that the second label text can represent the correct navigation intention, and the navigation result is generated by recognizing the text and the second label text, which is beneficial to improving the recognition effect of mixed languages.

Recognizing the recognition text according to a preset model to obtain a first label text includes:

According to the preset model, embedding processing is performed on the recognized text to obtain embedded text;

According to the preset model, encoding the recognized text to obtain an encoded text;

The first label text is generated according to the embedded text and the encoded text.

This can help improve the accuracy of text recognition.

The speech recognition method comprises:

Determine a navigation word label and an interest point label in the first label text, wherein the navigation word label corresponds to the navigation word text in the recognition text, and the interest point label corresponds to the interest point text in the recognition text;

When a text portion between the navigation word text and the point of interest text in the recognized text does not correspond to any one of the navigation word label or the point of interest label, it is determined that the first label text does not meet the preset condition.

In this way, it is convenient to determine whether a specific solution is needed to correct the first label text.

The speech recognition method comprises:

In the case where a text portion between the navigation word text and the interest point text in the recognition text corresponds to the navigation word label or the interest point label, the navigation result is generated according to the recognition text and the first label text.

In this way, when it is determined that no correction is needed, the navigation result can be obtained directly.

Correcting the first label text and generating a second label text includes:

Obtain at least two interest point combinations according to the text portion and the interest point text;

Determine the score of each interest point combination according to a preset vocabulary and word frequency features, each word of the recognition text having a corresponding word frequency feature;

The interest point label is modified to correspond to an interest point combination with the highest score, and the second label text is generated according to the navigation word label and the modified interest point label.

In this way, a specific solution for correcting the first label text can be implemented.

The speech recognition method comprises:

determining at least one point of interest entity in the recognized text;

Splitting the recognized text into a plurality of text segments, each of the interest point entities being located in one of the text segments;

According to a preset tag tree, a tag corresponding to each of the text segments is obtained, wherein the tag includes a navigation word tag and a point of interest tag;

Fill the multiple text fragments with corresponding labels, wherein, when the number of the interest point entities is at least two, one of the interest point entities is filled with an interest point type label, and at least one interest point entity is filled with an interest point limited name label, the interest point labels include the interest point type label and the interest point limited name label, and the interest point type label and the interest point limited name label have a corresponding dependency relationship in the number of labels.

This will help improve the ability to understand the semantics of complex sentences.

The speech recognition method comprises:

Generate the mapping relationship between core text and tags based on the intent vocabulary and tag mapping table;

Combining the plurality of labels to generate a plurality of label sentences, wherein different labels have a dependency relationship in the label sentences;

A tag tree is constructed according to the multiple tag sentence patterns.

In this way, the tag tree can be constructed.

A server of the present application includes a memory and a processor, the memory stores a computer program, and when the processor executes the computer program, the steps of any one of the above-mentioned speech recognition methods are implemented.

For the above-mentioned server, the first label text does not meet the preset conditions, indicating that the result recognized by the preset model cannot represent the correct navigation intention, so correction processing is performed on the basis of the first label text and a second label text is generated, so that the second label text can represent the correct navigation intention, and the navigation result is generated by recognizing the text and the second label text, which is conducive to improving the recognition effect of mixed languages.

A speech recognition system of the present application includes a server and a vehicle, wherein the server is used to:

receiving a voice request;

If it is determined that the first label text does not meet the preset condition, modifying the first label text and generating a second label text; and

generating a navigation result according to the recognized text and the second label text;

The vehicle is used for:

sending the voice request; and

The navigation result is received.

In the above-mentioned speech recognition system, if the first label text does not meet the preset conditions, it means that the result recognized by the preset model cannot represent the correct navigation intention. Therefore, correction processing is performed on the basis of the first label text and a second label text is generated so that the second label text can represent the correct navigation intention. The navigation result is generated by recognizing the text and the second label text, which is beneficial to improving the recognition effect of mixed languages.

The present application provides a computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, the steps of any one of the above-mentioned speech recognition methods are implemented.

Beneficial Effects

For the above-mentioned computer-readable storage medium, the first label text does not meet the preset conditions, indicating that the result recognized by the preset model cannot represent the correct navigation intention, so that correction processing is performed on the basis of the first label text and a second label text is generated, so that the second label text can represent the correct navigation intention, and the navigation result is generated by recognizing the text and the second label text, which is conducive to improving the recognition effect of mixed languages.

Additional aspects and advantages of the present application will be given in part in the description below, and in part will become apparent from the description below, or will be learned through the practice of the present application.

BRIEF DESCRIPTION OF THE DRAWINGS

The above and/or additional aspects and advantages of the present application will become apparent and easily understood from the description of the present application in conjunction with the following drawings, in which:

FIG1 is a flow chart of the speech recognition method of the present application;

FIG2 is a schematic diagram of a module of a server of the present application;

FIG3 is a schematic diagram of recognizing text by using a preset model in the present application;

FIG4 is a schematic diagram of a tag tree of the present application;

FIG5 is a schematic diagram of a speech recognition system of the present application.

Description of main component symbols:

Server 10, memory 11, processor 12;

Vehicle 20, vehicle-mounted terminal 21;

Speech recognition system 30.

Embodiments of the present invention

The embodiments of the present application are described in detail below, and examples of the embodiments are shown in the accompanying drawings, wherein the same or similar reference numerals throughout represent the same or similar elements or elements having the same or similar functions. The embodiments described below with reference to the accompanying drawings are exemplary and are only used to explain the present application, and cannot be understood as limiting the present application.

Please refer to FIG1 , a speech recognition method of the present application includes:

01: Get the recognized text, which is obtained by performing voice recognition on the voice request;

02: Recognize the recognition text according to the preset model to obtain the first label text;

03: When it is determined that the first label text does not meet the preset condition, the first label text is corrected and a second label text is generated;

04: Generate navigation results based on the recognized text and the second label text.

The speech recognition method of the present application can be implemented by the server 10 of the present application. Specifically, please refer to Figure 2, the server 10 includes a memory 11 and a processor 12. The memory 11 stores a computer program. The processor 12 can execute the computer program to implement the steps of the speech recognition method of the present application. Specifically, the processor 12 is used to: obtain the recognition text, the recognition text is obtained by performing speech recognition on the voice request; recognize the recognition text according to a preset model to obtain a first label text; when it is determined that the first label text does not meet the preset conditions, correct the first label text and generate a second label text; generate a navigation result based on the recognition text and the second label text.

In the above-mentioned speech recognition method and server 10, the first label text does not meet the preset conditions, indicating that the result recognized by the preset model cannot represent the correct navigation intention, so that correction processing is performed on the basis of the first label text and a second label text is generated, so that the second label text can represent the correct navigation intention, and the navigation result is generated by recognizing the text and the second label text, which is beneficial to improving the recognition effect of mixed languages.

The voice request corresponds to the voice information received from the user. The voice request may be in a mixed language. The mixed language may include different languages.

The recognized text is a text obtained by performing speech recognition on the speech request. The speech recognition on the speech request may be implemented by using ASR (Automatic Speech Recognition).

In addition, after obtaining the recognition text, semantic recognition can be performed on the recognition text first, so as to determine the field to which the user's voice request belongs. In an application scenario, the voice request may be "How is the weather today?", then the corresponding recognition text can be determined to belong to the weather-related field. In an application scenario, the voice request may be "Change a song", then the corresponding recognition text can be determined to belong to the music-related field. In an application scenario, the voice request may be "Navigate to the train station", then the corresponding recognition text can be determined to belong to the navigation-related field. After determining the field to which the recognition text belongs, the recognition text can be distributed to the processing module of the corresponding field through the central control.

When the recognized text is distributed to the processing module of the corresponding field, the POI (Point of Interest) in the recognized text can be recognized through a preset model. For example, when the recognized text is in a mixed language, the generalization of the POI in the recognized text can be improved through a preset model, and the corresponding POI can be recognized even if part of the text corresponding to the POI in the recognized text is outside the training set. After the corresponding POI is recognized, the recognition result can be uploaded to the rule engine to determine whether the recognition result needs to be corrected.

Specifically, the recognition result includes a first label text. The first label text is text information formed by multiple labels obtained by generating corresponding labels according to the type of corresponding content of the recognized text during the process of performing POI recognition on the recognized text. The main navigation-related information in the recognized text can be determined based on the first label text.

After obtaining the first label text, it can be determined whether the first label text meets the preset conditions, and whether the recognition result is sufficiently accurate can be determined based on the judgment result. If it is determined that the first label text does not meet the preset conditions, the first label text will be corrected to generate a second label text, and the second label text can correctly represent the navigation information in the recognition text, and then the navigation result is generated based on the recognition text and the second label text, and the user can determine the relevant information of the POI in the recognition text based on the navigation result.

Step 02 (recognizing the recognition text according to the preset model to obtain the first label text) includes:

According to the preset model, the recognized text is embedded to obtain the embedded text;

According to the preset model, the recognized text is encoded to obtain the encoded text;

A first label text is generated according to the embedded text and the encoded text.

The speech recognition method of the present application can be implemented by the server 10 of the present application. Specifically, referring to FIG2 , the processor 12 is used to: embed the recognized text according to a preset model to obtain an embedded text; encode the recognized text according to the preset model to obtain an encoded text; and generate a first label text according to the embedded text and the encoded text.

This can help improve the accuracy of text recognition.

Specifically, please refer to Figure 3. When the recognition text is recognized according to the preset model, the recognition text will be embedded and encoded respectively. In the embedding process, the recognition text will be language embedded (language embedding), position embedded (position embedding) and token embedded (token embedding). In the encoding process, the recognition text will be feature encoded (character encoder). After completing the embedding process and the encoding process, the processing result can be input into the converter for conversion to generate the first label text. The embedding process of the recognition text can be implemented by the BERT model (Multilingual Bidirectional Encoder Representations from Transformer, multilingual BERT). The encoding process of the recognition text can be implemented by the charbert model.

On the basis of the above, by combining the Charbert model and the Bert model for POI recognition, it can have good generalization for POIs with low word frequency in small languages. Compared with traditional pre-training models (such as the Bert model), it has better recognition effect on POIs with blurred boundaries in mixed languages, and can improve the overall accuracy by 3% in the scenario of mixed language recognition.

In an application scenario, the recognized text is "go to Sykehus near Det juridiske fakultet". After recognizing the recognized text according to the preset model, multiple text information can be obtained: "go—O", "to—O", "Sykehus—S-POI", "near—O", "Det—B-POI", "juridiske—I-POI", "fakultet—E-POI". Among them, "go—O", "to—O", and "near—O" respectively indicate that "go", "to", and "near" in the recognition text are recognized as other types of entity words, "Sykehus—S-POI" indicates that "Sykehus" in the recognition text is recognized as a single word representing POI, "Det—B-POI" indicates that "Det" in the recognition text is recognized as the starting part of the compound word representing POI, "juridiske—I-POI" indicates that "juridiske" in the recognition text is recognized as the middle part of the compound word representing POI, and "fakultet—E-POI" indicates that "fakultet" in the recognition text is recognized as the ending part of the compound word representing POI. The first label text finally obtained can be:

"entities": [{"word": "Sykehus","start": 2,"end": 3,"type": "POI"}, {"word": "Det juridiske fakultet","start": 4,"end": 7,"type": "POI"}].

Speech recognition methods include:

When the text portion between the navigation word text and the point of interest text in the recognition text does not correspond to any one of the navigation word label or the point of interest label, it is determined that the first label text does not meet the preset condition.

The speech recognition method of the present application can be implemented by the server 10 of the present application. Specifically, referring to FIG. 2 , the processor 12 is used to: determine the navigation word label and the point of interest label in the first label text, the navigation word label corresponds to the navigation word text in the recognition text, and the point of interest label corresponds to the point of interest text in the recognition text; if the text portion between the navigation word text and the point of interest text in the recognition text does not correspond to any one of the navigation word label or the point of interest label, determine that the first label text does not meet the preset condition.

Specifically, in some application scenarios, there may be recognition errors when recognizing the recognition text according to the preset model. For the scenario where the voice request is "i want to go Too Good To Go Norge", the actual POI is "Too Good To Go Norge", while the POI recognized in the recognition text may be "Norge". For the scenario where the voice request is "find me the quickest route to The Big 5 AS", the actual POI is "The Big 5 AS", but "The" may be omitted when recognizing in the recognition text, so that the obtained POI is "Big 5 AS". For the scenario where the voice request is "search for A 2 Pas Quadris", the actual POI is "A 2 Pas Quadris", but "A" may be omitted when recognizing in the recognition text, so that the obtained POI is "2 Pas Quadris". The above scenarios are more likely to occur when performing POI recognition on mixed languages.

For example, in the case of a voice request of "i want to go Too Good To Go Norge", when the corresponding first label text is obtained, the navigation word label "i want to go" will be filled in, and the point of interest label "Norge" will be filled in. Since the text part between the navigation word text and the point of interest text in the recognition text, i.e. "Too Good To Go", cannot be recognized as the corresponding type of entity word, the corresponding label cannot be filled in, and it can be considered that there is a recognition error problem, and the recognition result needs to be corrected to determine that the first label text does not meet the preset conditions.

Speech recognition methods include:

In the case where a text portion between the navigation word text and the point of interest text in the recognition text corresponds to a navigation word label or a point of interest label, a navigation result is generated according to the recognition text and the first label text.

The speech recognition method of the present application can be implemented by the server 10 of the present application. Specifically, referring to FIG. 2 , the processor 12 is used to: generate a navigation result according to the recognition text and the first label text when the text portion between the navigation word text and the point of interest text in the recognition text corresponds to the navigation word label or the point of interest label.

Specifically, when the text portion between the navigation word text and the point of interest text in the recognition text corresponds to a navigation word label or a point of interest label, it can be determined that the first label text meets the preset conditions, and the recognition text can be correctly identified through the first label text, so that the navigation result can be generated directly based on the recognition text combined with the first label text.

Based on the above, the preset condition can be understood as being used to determine whether the first label text can be used to directly generate a navigation result.

Step 03 (correcting the first label text and generating the second label text) includes:

Obtain at least two interest point combinations according to the text part and the interest point text;

According to the preset word list and word frequency features, the score of each interest point combination is determined, and each word of the recognition text has a corresponding word frequency feature;

The interest point label is modified to correspond to an interest point combination with the highest score, and a second label text is generated according to the navigation word label and the modified interest point label.

The speech recognition method of the present application can be implemented by the server 10 of the present application. Specifically, please refer to FIG2, the processor 12 is used to: obtain at least two interest point combinations according to the text part and the interest point text; determine the score of each interest point combination according to the preset word list and word frequency characteristics, and each word of the recognition text has a corresponding word frequency feature; correct the interest point label to correspond to an interest point combination with the highest score, and generate a second label text according to the navigation word label and the corrected interest point label.

Specifically, in an application scenario, the voice request is "I want to go Too Good To Go Norge". When it is determined that the first label text does not meet the preset condition, multiple interest point combinations can be obtained according to "Too Good To Go" and "Norge" according to the longest first match calculation, such as "Too Good To Go Norge", "Good To Go Norge", "To Go Norge", "Go Norge", and "Norge".

After determining a corresponding POI combination, the number of times all words in the POI combination co-occur can be found according to the preset word list. The preset word list can be a multi-country POI word list. The preset word list can be obtained through open source data and corresponding partners. The number of times all words co-occur can correspond to the co-occurrence frequency of bigrams and trigrams, so as to determine the number of times all words in the POI combination appear in the same POI at the same time (weighted co-occurrence frequency feature). In an application scenario, the number of times the POI combination "Too Good" appears in the same POI at the same time is 21, so the result (Too, Good, 21) can be obtained, and the number of times the POI combination "Too Good To" appears in the same POI at the same time is 17, so the result (Too, Good, To, 17) can be obtained.

Each word (entity word) in the interest point combination has a corresponding word frequency feature. The word frequency feature can be calculated by word frequency (tf, term frequency) and inverse document frequency index (idf, inverse document frequency). After the word frequency feature is calculated, the calculation result can be stored and directly obtained when needed.

For each combination of interest points, the corresponding score can be calculated by the following formula:

S=sum（tf*idf）*Fp*Fw

Among them, tf*idf represents the frequency feature of a specific word in the interest point combination, sum(tf*idf) represents the sum of the frequency features of all words in the interest point combination, Fp represents the penalty factor, and Fw represents the weighted co-occurrence frequency feature. The penalty factor can be determined according to the specific form of the specific interest point combination and the current POI recognition business. Different interest point combinations can have different penalty factors.

After obtaining the scores of all interest point combinations, the corresponding interest point label can be determined according to the interest point combination with the highest score. In an application scenario, when it is determined that the interest point combination with the highest score is "Too Good To Go Norge", "Too Good To Go Norge" will be filled as the interest point label, and the navigation word label is still "i want to go", so as to obtain the second label text according to the corrected result.

Speech recognition methods include:

Determining at least one point of interest entity in the recognized text;

The recognized text is split into multiple text segments, and each point of interest entity is located in one text segment;

According to the preset tag tree, obtain the tag corresponding to each text segment, which includes navigation word tags and point of interest tags;

Corresponding labels are filled in multiple text fragments, wherein, when the number of interest point entities is at least two, one of the interest point entities is filled with an interest point type label, and at least one interest point entity is filled with an interest point limited name label, the interest point labels include interest point type labels and interest point limited name labels, and the interest point type labels and interest point limited name labels have corresponding dependency relationships in the label tree.

The speech recognition method of the present application can be implemented by the server 10 of the present application. Specifically, please refer to Figure 2, the processor 12 is used to: determine at least one point of interest entity in the recognized text; split the recognized text into multiple text segments, each point of interest entity is located in a text segment; according to the preset tag tree, obtain the label corresponding to each text segment, the label includes a navigation word label and a point of interest label; fill the corresponding labels for the multiple text segments, wherein, when the number of point of interest entities is at least two, one of the point of interest entities is filled with a point of interest type label, and at least one point of interest entity is filled with a point of interest limited name label, the point of interest label includes a point of interest type label and a point of interest limited name label, and the point of interest type label and the point of interest limited name label have a corresponding dependency relationship in the tag tree.

In an application scenario, the recognized text is "Please go to Sykehus on my way Det juridiske fakultet go highways". Through POI recognition, it can be determined that the points of interest in the recognized text include "Sykehus" and "Det juridiske fakultet". Then the recognized text is split into multiple text segments: "Please go to", "Sykehus", "on my way", "Det juridiske fakultet", "go highways". The point of interest "Sykehus" is located in the second text segment. The point of interest "Det juridiske fakultet" is located in the fourth text segment.

After all the text fragments are obtained, regular expression recognition will be performed according to the entity vocabulary corresponding to navigation, so that the text fragments can be as close as possible to the text mainly mapped by the tag in the tag tree. Specifically, the text with similar meanings of "Please go to" in the entity vocabulary may include "navigate to", so "Please go to" can be recognized as "navigate to". "On my way" and "go highways" can be found in the entity vocabulary, and thus they will be recognized as "on my way" and "go highways" respectively.

After regular expression recognition is completed, the recognized text is mapped to the corresponding label according to the mapping relationship between the entity vocabulary and the label tree. Specifically, according to the above mapping relationship, the label corresponding to "go to" is "kw_navigate (knowledge: navigation)", the label corresponding to "on my way" is "on_my_way", and the label corresponding to "go highways" is "route_preference".

In the case where the number of point of interest entities is at least two, it can be understood that if all point of interest entities are filled with point of interest tags, the actual locations corresponding to all point of interest entities may be used as navigation destinations. On the basis of the foregoing, according to the dependency relationship between the point of interest type tags and the point of interest limited name tags in the tag tree, the logical relationship between multiple point of interest entities in the text can be clearly identified. Specifically, the point of interest "Sykehus" is the actual navigation destination to be traveled to, and the point of interest "Det juridiske fakultet" represents the positional relationship with the navigation destination, which can be used to determine the location of the navigation destination. That is to say, the label corresponding to the point of interest "Sykehus" should be the point of interest type label, and the label corresponding to the point of interest "Det juridiske fakultet" should be the point of interest limited name label. Therefore, in the process of label filling, the label of the point of interest "Sykehus" is filled with the point of interest type label (POI_type), and the label of the point of interest "Det juridiske fakultet" is filled with the point of interest limited name label (limit_name), thereby achieving the effect of transcribing the semantic labels of the recognized text. In the actual semantic recognition scenario, it is beneficial to improve the semantic understanding ability of complex sentences and reduce the recognition errors caused by the inability to distinguish multiple recognized points of interest.

Speech recognition methods include:

Combining multiple labels into multiple label sentences, in which different labels have dependency relationships;

Construct a tag tree based on multiple tag sentences.

The speech recognition method of the present application can be implemented by the server 10 of the present application. Specifically, referring to FIG. 2 , the processor 12 is used to: generate a mapping relationship between core text and tags according to the intention vocabulary and the tag mapping table; perform sentence combination on multiple tags to generate multiple tag sentences, in which different tags have a dependency relationship; and construct a tag tree according to multiple tag sentences.

In this way, the tag tree can be constructed.

Please refer to Figure 4, which shows an achievable label tree. Specifically, according to the core intent vocabulary and the label mapping table, the core text located in the core intent vocabulary can be mapped to the label mapping table. In the process of recognizing the recognized text, if the corresponding text segment is identified as the core text, or the text with similar meaning is the core text, the text segment can be mapped to the corresponding label in the label mapping table. In an application scenario, the core text "navigate to" can be mapped to the label "K: navigate" (corresponding to "Knowledge: Navigation" shown in Figure 4).

In the case of determining the above mapping relationship, according to the semantic relationship between the core texts, multiple tags can be combined into sentences to obtain the corresponding label sentence. In an application scenario, the recognized text is "go to KFC", where "go to" and "navigate to" have similar meanings, and "KFC" is recognized as a point of interest, so that the corresponding label sentence can be obtained as "K: navigate POI_type", where "POI_type" can correspond to "point of interest_type" in Figure 4, and the label "K: navigate" and the label "POI_type" in the label sentence form a dependent relationship. Among them, the label "POI_type" corresponds to the point of interest type label.

On the basis of the above, when multiple label sentences are obtained according to specific sentence combinations, multiple label sentences can be integrated to finally obtain a label tree. In an application scenario, the original file format of the label tree is: "template":"[D:POI_NAME@poi_name][K:nearby][D:POI_ADDRESS|DISTRICT@limit_address]"

In addition, in FIG4 , the core text mapped by the tag “K: nearby” may include “close to” and “near.” The tag “limit_name” corresponds to the interest point limit name tag.

In summary, the speech recognition method of the present application can achieve the following effects:

1. A mixed-language navigation semantic understanding solution is proposed, which can be extended to multiple language scenarios in a country;

2. The POI extraction algorithm based on char+mbert can help extract points of interest in mixed languages;

3. Through correction processing, the degree to which POIs in mixed languages are affected by English expressions can be reduced;

4. Support semantic understanding of selectable route preferences.

Please refer to FIG5 , a speech recognition system 30 of the present application includes a server 10 and a vehicle 20. The server 10 is used to: receive a speech request; obtain a recognition text, the recognition text is obtained by performing speech recognition on the speech request; recognize the recognition text according to a preset model to obtain a first label text; if it is determined that the first label text does not meet the preset conditions, correct the first label text and generate a second label text; and generate a navigation result according to the recognition text and the second label text. The vehicle 20 is used to: send a speech request; and receive a navigation result.

In the above-mentioned speech recognition system 30, the first label text does not meet the preset conditions, indicating that the result recognized by the preset model cannot represent the correct navigation intention, so that correction processing is performed on the basis of the first label text and a second label text is generated, so that the second label text can represent the correct navigation intention, and the navigation result is generated by recognizing the text and the second label text, which is beneficial to improving the recognition effect of mixed languages.

Specifically, please refer to FIG. 2 and FIG. 5. In FIG. 5, the vehicle 20 may include an on-board terminal 21. The vehicle 20 may obtain a voice request issued by a user through the on-board terminal 21, and send the obtained voice request to the server 10. In FIG. 2, the server 10 may receive a voice request sent by the on-board terminal 21. The voice request is transmitted to the processor 12, so that the processor 12 finally generates a navigation result according to the voice request. The server 10 may transmit the navigation result to the vehicle 20, and the vehicle 20 may receive the navigation result through the on-board terminal 21, and may feed back the navigation result to the user (such as displaying it to the user, or informing the user through voice broadcast).

A computer-readable storage medium stores a computer program, which, when executed by a processor, implements the steps of any of the above-mentioned speech recognition methods.

For example, when the computer program is executed, the following steps may be implemented:

The computer-readable storage medium may be provided in the server 10 or in other terminals. The server 10 may communicate with other terminals to obtain the corresponding program.

It can be understood that computer-readable storage media may include: any entity or device capable of carrying a computer program, recording media, USB flash drives, mobile hard disks, magnetic disks, optical disks, computer memory, read-only memory (ROM), random access memory (RAM), and software distribution media, etc. A computer program includes computer program code. The computer program code may be in source code form, object code form, executable file, or some intermediate form, etc. A computer-readable storage medium may include: any entity or device capable of carrying a computer program code, recording media, USB flash drives, mobile hard disks, magnetic disks, optical disks, computer memory, read-only memory (ROM), random access memory (RAM), and software distribution media.

Any process or method description in a flowchart or otherwise described herein may be understood to represent a module, segment or portion of code that includes one or more executable instructions for implementing the steps of a specific logical function or process, and the scope of the present application includes additional implementations in which the functions may not be performed in the order shown or discussed, including performing the functions in a substantially simultaneous manner or in the reverse order depending on the functions involved, which should be understood by technicians in the technical field to which the embodiments of the present application belong.

The logic and/or steps represented in the flowchart or otherwise described herein, for example, can be considered as an ordered list of executable instructions for implementing logical functions, and can be specifically implemented in any computer-readable medium for use by an instruction execution system, device or apparatus (such as a computer-based system, a system including a processing module, or other system that can fetch instructions from an instruction execution system, device or apparatus and execute instructions), or used in combination with these instruction execution systems, devices or apparatuses.

In addition, the terms "first" and "second" are used for descriptive purposes only and should not be understood as indicating or implying relative importance or implicitly indicating the number of the indicated technical features. Thus, the features defined as "first" and "second" may explicitly or implicitly include one or more of the features. In the description of this application, the meaning of "plurality" is two or more, unless otherwise clearly and specifically defined.

Although the present application has been shown and described, it will be appreciated by those skilled in the art that various changes, modifications, substitutions and variations may be made to the present application without departing from the principles and spirit of the present application, and the scope of the present application is defined by the claims and their equivalents.

Claims

A speech recognition method, wherein the speech recognition method comprises:

Acquire a recognition text, where the recognition text is obtained by performing voice recognition on the voice request;

Recognize the recognition text according to a preset model to obtain a first label text;

When it is determined that the first label text does not meet the preset condition, the first label text is corrected and a second label text is generated;

A navigation result is generated according to the recognized text and the second label text.
The speech recognition method according to claim 1, wherein the recognition text is recognized according to a preset model to obtain the first label text, comprising:

According to the preset model, embedding processing is performed on the recognized text to obtain embedded text;

According to the preset model, encoding the recognized text to obtain an encoded text;

The first label text is generated according to the embedded text and the encoded text.
The speech recognition method according to claim 1, wherein the speech recognition method comprises:

Determine a navigation word label and an interest point label in the first label text, wherein the navigation word label corresponds to the navigation word text in the recognition text, and the interest point label corresponds to the interest point text in the recognition text;

When a text portion between the navigation word text and the point of interest text in the recognized text does not correspond to any one of the navigation word label or the point of interest label, it is determined that the first label text does not meet the preset condition.
The speech recognition method according to claim 3, wherein the speech recognition method comprises:

In the case where a text portion between the navigation word text and the interest point text in the recognition text corresponds to the navigation word label or the interest point label, the navigation result is generated according to the recognition text and the first label text.
The speech recognition method according to claim 3, wherein the step of performing correction processing on the first label text and generating the second label text comprises:

Obtain at least two interest point combinations according to the text portion and the interest point text;

Determine the score of each interest point combination according to a preset vocabulary and word frequency features, each word of the recognition text having a corresponding word frequency feature;

The interest point label is modified to correspond to an interest point combination with the highest score, and the second label text is generated according to the navigation word label and the modified interest point label.
The speech recognition method according to claim 1, wherein the speech recognition method comprises:

determining at least one point of interest entity in the recognized text;

Splitting the recognized text into a plurality of text segments, each of the interest point entities being located in one of the text segments;

According to a preset tag tree, a tag corresponding to each of the text segments is obtained, wherein the tag includes a navigation word tag and a point of interest tag;

Fill the multiple text fragments with corresponding labels, wherein, when the number of the interest point entities is at least two, one of the interest point entities is filled with an interest point type label, and at least one interest point entity is filled with an interest point limited name label, the interest point labels include the interest point type label and the interest point limited name label, and the interest point type label and the interest point limited name label have a corresponding dependency relationship in the number of labels.
The speech recognition method according to claim 1, wherein the speech recognition method comprises:

Generate the mapping relationship between core text and tags based on the intent vocabulary and tag mapping table;

Combining the plurality of labels to generate a plurality of label sentences, wherein different labels have a dependency relationship in the label sentences;

A tag tree is constructed according to the multiple tag sentence patterns.
A server, wherein the server comprises a memory and a processor, the memory stores a computer program, and when the processor executes the computer program, the steps of the speech recognition method described in any one of claims 1 to 7 are implemented.
A speech recognition system, wherein the speech recognition system comprises a server and a vehicle, wherein the server is used for:

receiving a voice request;

Acquire a recognition text, where the recognition text is obtained by performing voice recognition on the voice request;

Recognize the recognition text according to a preset model to obtain a first label text;

If it is determined that the first label text does not meet the preset condition, modifying the first label text and generating a second label text; and

generating a navigation result according to the recognized text and the second label text;

The vehicle is used for:

sending the voice request; and

The navigation result is received.
A computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of the speech recognition method according to any one of claims 1 to 7.