WO2024067471A1 - Speech recognition method, and server, speech recognition system and readable storage medium - Google Patents

Speech recognition method, and server, speech recognition system and readable storage medium Download PDF

Info

Publication number
WO2024067471A1
WO2024067471A1 PCT/CN2023/121063 CN2023121063W WO2024067471A1 WO 2024067471 A1 WO2024067471 A1 WO 2024067471A1 CN 2023121063 W CN2023121063 W CN 2023121063W WO 2024067471 A1 WO2024067471 A1 WO 2024067471A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
label
speech recognition
interest
interest point
Prior art date
Application number
PCT/CN2023/121063
Other languages
French (fr)
Chinese (zh)
Inventor
李明洋
Original Assignee
广州小鹏汽车科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州小鹏汽车科技有限公司 filed Critical 广州小鹏汽车科技有限公司
Publication of WO2024067471A1 publication Critical patent/WO2024067471A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01CMEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
    • G01C21/00Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00
    • G01C21/26Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00 specially adapted for navigation in a road network
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/322Trees
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Definitions

  • the present application relates to the field of vehicle navigation technology, and in particular to a speech recognition method, a server, a speech recognition system and a readable storage medium.
  • the present application provides a speech recognition method, a server, a speech recognition system and a readable storage medium.
  • a speech recognition method of the present application includes:
  • the first label text is corrected and a second label text is generated;
  • a navigation result is generated according to the recognized text and the second label text.
  • the first label text does not meet the preset conditions, which means that the result recognized by the preset model cannot represent the correct navigation intention. Therefore, correction processing is performed on the basis of the first label text and a second label text is generated, so that the second label text can represent the correct navigation intention, and the navigation result is generated by recognizing the text and the second label text, which is beneficial to improving the recognition effect of mixed languages.
  • the first label text is generated according to the embedded text and the encoded text.
  • the speech recognition method comprises:
  • the speech recognition method comprises:
  • the navigation result is generated according to the recognition text and the first label text.
  • the interest point label is modified to correspond to an interest point combination with the highest score, and the second label text is generated according to the navigation word label and the modified interest point label.
  • the speech recognition method comprises:
  • a tag corresponding to each of the text segments is obtained, wherein the tag includes a navigation word tag and a point of interest tag;
  • the interest point labels include the interest point type label and the interest point limited name label, and the interest point type label and the interest point limited name label have a corresponding dependency relationship in the number of labels.
  • the speech recognition method comprises:
  • a tag tree is constructed according to the multiple tag sentence patterns.
  • the tag tree can be constructed.
  • a server of the present application includes a memory and a processor, the memory stores a computer program, and when the processor executes the computer program, the steps of any one of the above-mentioned speech recognition methods are implemented.
  • the first label text does not meet the preset conditions, indicating that the result recognized by the preset model cannot represent the correct navigation intention, so correction processing is performed on the basis of the first label text and a second label text is generated, so that the second label text can represent the correct navigation intention, and the navigation result is generated by recognizing the text and the second label text, which is conducive to improving the recognition effect of mixed languages.
  • a speech recognition system of the present application includes a server and a vehicle, wherein the server is used to:
  • the vehicle is used for:
  • the navigation result is received.
  • the speech recognition system if the first label text does not meet the preset conditions, it means that the result recognized by the preset model cannot represent the correct navigation intention. Therefore, correction processing is performed on the basis of the first label text and a second label text is generated so that the second label text can represent the correct navigation intention.
  • the navigation result is generated by recognizing the text and the second label text, which is beneficial to improving the recognition effect of mixed languages.
  • the present application provides a computer-readable storage medium on which a computer program is stored.
  • the computer program is executed by a processor, the steps of any one of the above-mentioned speech recognition methods are implemented.
  • the first label text does not meet the preset conditions, indicating that the result recognized by the preset model cannot represent the correct navigation intention, so that correction processing is performed on the basis of the first label text and a second label text is generated, so that the second label text can represent the correct navigation intention, and the navigation result is generated by recognizing the text and the second label text, which is conducive to improving the recognition effect of mixed languages.
  • FIG1 is a flow chart of the speech recognition method of the present application.
  • FIG2 is a schematic diagram of a module of a server of the present application.
  • FIG3 is a schematic diagram of recognizing text by using a preset model in the present application.
  • FIG4 is a schematic diagram of a tag tree of the present application.
  • FIG5 is a schematic diagram of a speech recognition system of the present application.
  • Server 10 memory 11, processor 12;
  • Vehicle 20 vehicle-mounted terminal 21;
  • a speech recognition method of the present application includes:
  • the speech recognition method of the present application can be implemented by the server 10 of the present application.
  • the server 10 includes a memory 11 and a processor 12.
  • the memory 11 stores a computer program.
  • the processor 12 can execute the computer program to implement the steps of the speech recognition method of the present application.
  • the processor 12 is used to: obtain the recognition text, the recognition text is obtained by performing speech recognition on the voice request; recognize the recognition text according to a preset model to obtain a first label text; when it is determined that the first label text does not meet the preset conditions, correct the first label text and generate a second label text; generate a navigation result based on the recognition text and the second label text.
  • the first label text does not meet the preset conditions, indicating that the result recognized by the preset model cannot represent the correct navigation intention, so that correction processing is performed on the basis of the first label text and a second label text is generated, so that the second label text can represent the correct navigation intention, and the navigation result is generated by recognizing the text and the second label text, which is beneficial to improving the recognition effect of mixed languages.
  • the voice request corresponds to the voice information received from the user.
  • the voice request may be in a mixed language.
  • the mixed language may include different languages.
  • the recognized text is a text obtained by performing speech recognition on the speech request.
  • the speech recognition on the speech request may be implemented by using ASR (Automatic Speech Recognition).
  • semantic recognition can be performed on the recognition text first, so as to determine the field to which the user's voice request belongs.
  • the voice request may be "How is the weather today?", then the corresponding recognition text can be determined to belong to the weather-related field.
  • the voice request may be "Change a song”, then the corresponding recognition text can be determined to belong to the music-related field.
  • the voice request may be "Navigate to the train station”, then the corresponding recognition text can be determined to belong to the navigation-related field.
  • the recognition text can be distributed to the processing module of the corresponding field through the central control.
  • the POI Point of Interest
  • the recognition result can be uploaded to the rule engine to determine whether the recognition result needs to be corrected.
  • the recognition result includes a first label text.
  • the first label text is text information formed by multiple labels obtained by generating corresponding labels according to the type of corresponding content of the recognized text during the process of performing POI recognition on the recognized text.
  • the main navigation-related information in the recognized text can be determined based on the first label text.
  • the first label text After obtaining the first label text, it can be determined whether the first label text meets the preset conditions, and whether the recognition result is sufficiently accurate can be determined based on the judgment result. If it is determined that the first label text does not meet the preset conditions, the first label text will be corrected to generate a second label text, and the second label text can correctly represent the navigation information in the recognition text, and then the navigation result is generated based on the recognition text and the second label text, and the user can determine the relevant information of the POI in the recognition text based on the navigation result.
  • Step 02 recognizing the recognition text according to the preset model to obtain the first label text.
  • the recognized text is embedded to obtain the embedded text
  • the recognized text is encoded to obtain the encoded text
  • a first label text is generated according to the embedded text and the encoded text.
  • the speech recognition method of the present application can be implemented by the server 10 of the present application.
  • the processor 12 is used to: embed the recognized text according to a preset model to obtain an embedded text; encode the recognized text according to the preset model to obtain an encoded text; and generate a first label text according to the embedded text and the encoded text.
  • the recognition text When the recognition text is recognized according to the preset model, the recognition text will be embedded and encoded respectively.
  • the recognition text will be language embedded (language embedding), position embedded (position embedding) and token embedded (token embedding).
  • the recognition text In the encoding process, the recognition text will be feature encoded (character encoder).
  • the processing result After completing the embedding process and the encoding process, the processing result can be input into the converter for conversion to generate the first label text.
  • the embedding process of the recognition text can be implemented by the BERT model (Multilingual Bidirectional Encoder Representations from Transformer, multilingual BERT).
  • the encoding process of the recognition text can be implemented by the charbert model.
  • the recognized text is "go to Sykehus near Det juridiske fakultet”.
  • multiple text information can be obtained: “go—O”, “to—O”, “Sykehus—S-POI”, “near—O”, “Det—B-POI”, “juridiske—I-POI”, “fakultet—E-POI”.
  • “go—O”, “to—O”, and “near—O” respectively indicate that "go”, "to”, and “near” in the recognition text are recognized as other types of entity words
  • “Sykehus—S-POI” indicates that "Sykehus” in the recognition text is recognized as a single word representing POI
  • “Det—B-POI” indicates that "Det” in the recognition text is recognized as the starting part of the compound word representing POI
  • "juridiske—I-POI” indicates that "juridiske” in the recognition text is recognized as the middle part of the compound word representing POI
  • “fakultet—E-POI” indicates that "fakultet” in the recognition text is recognized as the ending part of the compound word representing POI.
  • the first label text finally obtained can be:
  • Speech recognition methods include:
  • the speech recognition method of the present application can be implemented by the server 10 of the present application.
  • the processor 12 is used to: determine the navigation word label and the point of interest label in the first label text, the navigation word label corresponds to the navigation word text in the recognition text, and the point of interest label corresponds to the point of interest text in the recognition text; if the text portion between the navigation word text and the point of interest text in the recognition text does not correspond to any one of the navigation word label or the point of interest label, determine that the first label text does not meet the preset condition.
  • the actual POI is "Too Good To Go Norge”
  • the POI recognized in the recognition text may be "Norge”.
  • the voice request is "find me the quickest route to The Big 5 AS”
  • the actual POI is "The Big 5 AS”
  • “The” may be omitted when recognizing in the recognition text, so that the obtained POI is "Big 5 AS”.
  • the navigation word label "i want to go” will be filled in, and the point of interest label "Norge” will be filled in. Since the text part between the navigation word text and the point of interest text in the recognition text, i.e. "Too Good To Go", cannot be recognized as the corresponding type of entity word, the corresponding label cannot be filled in, and it can be considered that there is a recognition error problem, and the recognition result needs to be corrected to determine that the first label text does not meet the preset conditions.
  • Speech recognition methods include:
  • a navigation result is generated according to the recognition text and the first label text.
  • the speech recognition method of the present application can be implemented by the server 10 of the present application.
  • the processor 12 is used to: generate a navigation result according to the recognition text and the first label text when the text portion between the navigation word text and the point of interest text in the recognition text corresponds to the navigation word label or the point of interest label.
  • the text portion between the navigation word text and the point of interest text in the recognition text corresponds to a navigation word label or a point of interest label
  • the preset condition can be understood as being used to determine whether the first label text can be used to directly generate a navigation result.
  • Step 03 correcting the first label text and generating the second label text includes:
  • the score of each interest point combination is determined, and each word of the recognition text has a corresponding word frequency feature;
  • the interest point label is modified to correspond to an interest point combination with the highest score, and a second label text is generated according to the navigation word label and the modified interest point label.
  • the speech recognition method of the present application can be implemented by the server 10 of the present application.
  • the processor 12 is used to: obtain at least two interest point combinations according to the text part and the interest point text; determine the score of each interest point combination according to the preset word list and word frequency characteristics, and each word of the recognition text has a corresponding word frequency feature; correct the interest point label to correspond to an interest point combination with the highest score, and generate a second label text according to the navigation word label and the corrected interest point label.
  • the voice request is "I want to go Too Good To Go Norge”.
  • multiple interest point combinations can be obtained according to "Too Good To Go” and "Norge” according to the longest first match calculation, such as “Too Good To Go Norge”, “Good To Go Norge”, “To Go Norge”, “Go Norge”, and “Norge”.
  • the number of times all words in the POI combination co-occur can be found according to the preset word list.
  • the preset word list can be a multi-country POI word list.
  • the preset word list can be obtained through open source data and corresponding partners.
  • the number of times all words co-occur can correspond to the co-occurrence frequency of bigrams and trigrams, so as to determine the number of times all words in the POI combination appear in the same POI at the same time (weighted co-occurrence frequency feature).
  • the number of times the POI combination "Too Good” appears in the same POI at the same time is 21, so the result (Too, Good, 21) can be obtained, and the number of times the POI combination "Too Good To” appears in the same POI at the same time is 17, so the result (Too, Good, To, 17) can be obtained.
  • Each word (entity word) in the interest point combination has a corresponding word frequency feature.
  • the word frequency feature can be calculated by word frequency (tf, term frequency) and inverse document frequency index (idf, inverse document frequency). After the word frequency feature is calculated, the calculation result can be stored and directly obtained when needed.
  • tf*idf represents the frequency feature of a specific word in the interest point combination
  • sum(tf*idf) represents the sum of the frequency features of all words in the interest point combination
  • Fp represents the penalty factor
  • Fw represents the weighted co-occurrence frequency feature.
  • the penalty factor can be determined according to the specific form of the specific interest point combination and the current POI recognition business. Different interest point combinations can have different penalty factors.
  • the corresponding interest point label can be determined according to the interest point combination with the highest score.
  • the interest point combination with the highest score is "Too Good To Go Norge”
  • “Too Good To Go Norge” will be filled as the interest point label, and the navigation word label is still "i want to go”, so as to obtain the second label text according to the corrected result.
  • Speech recognition methods include:
  • the recognized text is split into multiple text segments, and each point of interest entity is located in one text segment;
  • the preset tag tree obtain the tag corresponding to each text segment, which includes navigation word tags and point of interest tags;
  • Corresponding labels are filled in multiple text fragments, wherein, when the number of interest point entities is at least two, one of the interest point entities is filled with an interest point type label, and at least one interest point entity is filled with an interest point limited name label, the interest point labels include interest point type labels and interest point limited name labels, and the interest point type labels and interest point limited name labels have corresponding dependency relationships in the label tree.
  • the speech recognition method of the present application can be implemented by the server 10 of the present application.
  • the processor 12 is used to: determine at least one point of interest entity in the recognized text; split the recognized text into multiple text segments, each point of interest entity is located in a text segment; according to the preset tag tree, obtain the label corresponding to each text segment, the label includes a navigation word label and a point of interest label; fill the corresponding labels for the multiple text segments, wherein, when the number of point of interest entities is at least two, one of the point of interest entities is filled with a point of interest type label, and at least one point of interest entity is filled with a point of interest limited name label, the point of interest label includes a point of interest type label and a point of interest limited name label, and the point of interest type label and the point of interest limited name label have a corresponding dependency relationship in the tag tree.
  • the recognized text is "Please go to Sykehus on my way Det juridiske fakultet go highways”.
  • the points of interest in the recognized text include “Sykehus” and “Det juridiske fakultet”.
  • the recognized text is split into multiple text segments: "Please go to”, “Sykehus”, “on my way”, “Det juridiske fakultet", “go highways”.
  • the point of interest "Sykehus” is located in the second text segment.
  • the point of interest "Det juridiske fakultet” is located in the fourth text segment.
  • the text with similar meanings of "Please go to” in the entity vocabulary may include “navigate to”, so “Please go to” can be recognized as “navigate to”. "On my way” and “go highways” can be found in the entity vocabulary, and thus they will be recognized as “on my way” and “go highways” respectively.
  • the recognized text is mapped to the corresponding label according to the mapping relationship between the entity vocabulary and the label tree.
  • the label corresponding to "go to” is "kw_navigate (knowledge: navigation)”
  • the label corresponding to "on my way” is “on_my_way”
  • the label corresponding to "go highways” is "route_preference”.
  • the number of point of interest entities is at least two, it can be understood that if all point of interest entities are filled with point of interest tags, the actual locations corresponding to all point of interest entities may be used as navigation destinations.
  • the logical relationship between multiple point of interest entities in the text can be clearly identified.
  • the point of interest "Sykehus” is the actual navigation destination to be traveled to
  • the point of interest "Det juridiske fakultet” represents the positional relationship with the navigation destination, which can be used to determine the location of the navigation destination.
  • the label corresponding to the point of interest "Sykehus” should be the point of interest type label
  • the label corresponding to the point of interest "Det juridiske fakultet” should be the point of interest limited name label. Therefore, in the process of label filling, the label of the point of interest "Sykehus” is filled with the point of interest type label (POI_type), and the label of the point of interest "Det juridiske fakultet” is filled with the point of interest limited name label (limit_name), thereby achieving the effect of transcribing the semantic labels of the recognized text.
  • POI_type point of interest type label
  • limit_name the point of interest limited name label
  • Speech recognition methods include:
  • the speech recognition method of the present application can be implemented by the server 10 of the present application.
  • the processor 12 is used to: generate a mapping relationship between core text and tags according to the intention vocabulary and the tag mapping table; perform sentence combination on multiple tags to generate multiple tag sentences, in which different tags have a dependency relationship; and construct a tag tree according to multiple tag sentences.
  • the tag tree can be constructed.
  • Figure 4 shows an achievable label tree.
  • the core text located in the core intent vocabulary can be mapped to the label mapping table.
  • the text segment can be mapped to the corresponding label in the label mapping table.
  • the core text "navigate to” can be mapped to the label "K: navigate” (corresponding to "Knowledge: Navigation” shown in Figure 4).
  • the recognized text is "go to KFC", where "go to” and “navigate to” have similar meanings
  • KFC is recognized as a point of interest
  • the corresponding label sentence can be obtained as "K: navigate POI_type", where "POI_type” can correspond to "point of interest_type” in Figure 4, and the label "K: navigate” and the label "POI_type” in the label sentence form a dependent relationship.
  • the label "POI_type” corresponds to the point of interest type label.
  • the original file format of the label tree is: "template”:"[D:POI_NAME@poi_name][K:nearby][D:POI_ADDRESS
  • the core text mapped by the tag “K: nearby” may include “close to” and “near.”
  • the tag “limit_name” corresponds to the interest point limit name tag.
  • the speech recognition method of the present application can achieve the following effects:
  • a mixed-language navigation semantic understanding solution is proposed, which can be extended to multiple language scenarios in a country;
  • the POI extraction algorithm based on char+mbert can help extract points of interest in mixed languages
  • a speech recognition system 30 of the present application includes a server 10 and a vehicle 20.
  • the server 10 is used to: receive a speech request; obtain a recognition text, the recognition text is obtained by performing speech recognition on the speech request; recognize the recognition text according to a preset model to obtain a first label text; if it is determined that the first label text does not meet the preset conditions, correct the first label text and generate a second label text; and generate a navigation result according to the recognition text and the second label text.
  • the vehicle 20 is used to: send a speech request; and receive a navigation result.
  • the first label text does not meet the preset conditions, indicating that the result recognized by the preset model cannot represent the correct navigation intention, so that correction processing is performed on the basis of the first label text and a second label text is generated, so that the second label text can represent the correct navigation intention, and the navigation result is generated by recognizing the text and the second label text, which is beneficial to improving the recognition effect of mixed languages.
  • the vehicle 20 may include an on-board terminal 21.
  • the vehicle 20 may obtain a voice request issued by a user through the on-board terminal 21, and send the obtained voice request to the server 10.
  • the server 10 may receive a voice request sent by the on-board terminal 21.
  • the voice request is transmitted to the processor 12, so that the processor 12 finally generates a navigation result according to the voice request.
  • the server 10 may transmit the navigation result to the vehicle 20, and the vehicle 20 may receive the navigation result through the on-board terminal 21, and may feed back the navigation result to the user (such as displaying it to the user, or informing the user through voice broadcast).
  • a computer-readable storage medium stores a computer program, which, when executed by a processor, implements the steps of any of the above-mentioned speech recognition methods.
  • the computer-readable storage medium may be provided in the server 10 or in other terminals.
  • the server 10 may communicate with other terminals to obtain the corresponding program.
  • computer-readable storage media may include: any entity or device capable of carrying a computer program, recording media, USB flash drives, mobile hard disks, magnetic disks, optical disks, computer memory, read-only memory (ROM), random access memory (RAM), and software distribution media, etc.
  • a computer program includes computer program code. The computer program code may be in source code form, object code form, executable file, or some intermediate form, etc.
  • a computer-readable storage medium may include: any entity or device capable of carrying a computer program code, recording media, USB flash drives, mobile hard disks, magnetic disks, optical disks, computer memory, read-only memory (ROM), random access memory (RAM), and software distribution media.
  • Any process or method description in a flowchart or otherwise described herein may be understood to represent a module, segment or portion of code that includes one or more executable instructions for implementing the steps of a specific logical function or process, and the scope of the present application includes additional implementations in which the functions may not be performed in the order shown or discussed, including performing the functions in a substantially simultaneous manner or in the reverse order depending on the functions involved, which should be understood by technicians in the technical field to which the embodiments of the present application belong.
  • the logic and/or steps represented in the flowchart or otherwise described herein, for example, can be considered as an ordered list of executable instructions for implementing logical functions, and can be specifically implemented in any computer-readable medium for use by an instruction execution system, device or apparatus (such as a computer-based system, a system including a processing module, or other system that can fetch instructions from an instruction execution system, device or apparatus and execute instructions), or used in combination with these instruction execution systems, devices or apparatuses.
  • first and second are used for descriptive purposes only and should not be understood as indicating or implying relative importance or implicitly indicating the number of the indicated technical features.
  • the features defined as “first” and “second” may explicitly or implicitly include one or more of the features.
  • the meaning of “plurality” is two or more, unless otherwise clearly and specifically defined.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Remote Sensing (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Automation & Control Theory (AREA)
  • Navigation (AREA)

Abstract

A speech recognition method, and a server, a speech recognition system and a readable storage medium. The speech recognition method comprises: acquiring a recognition text, which is obtained by means of performing speech recognition on a speech request (01); recognizing the recognition text according to a preset model, so as to obtain a first label text (02); when it is determined that the first label text does not satisfy a preset condition, performing correction processing on the first label text, and then generating a second label text (03); and generating a navigation result according to the recognition text and the second label text (04).

Description

语音识别方法、服务器、语音识别系统和可读存储介质Speech recognition method, server, speech recognition system and readable storage medium
本申请要求于2022年9月26号申请的、申请号为202211170954.8的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to Chinese patent application No. 202211170954.8 filed on September 26, 2022, the entire contents of which are incorporated by reference into this application.
技术领域Technical Field
本申请涉及车辆导航技术领域,特别涉及一种语音识别方法、服务器、语音识别系统和可读存储介质。The present application relates to the field of vehicle navigation technology, and in particular to a speech recognition method, a server, a speech recognition system and a readable storage medium.
背景技术Background technique
在进行语音导航时,可能会存在对涉及小语种的混合语言进行识别的情况。单一语种的语言识别容易产生OOV(Out-of-vocabulary,未登录词)的问题。在相关技术中,可以在一定程度上解决OOV问题,但仍存在子单词包含信息不全的情况,从而导致对混合语言进行语音识别的效果较低。When performing voice navigation, there may be situations where mixed languages involving minority languages need to be recognized. Language recognition of a single language is prone to the OOV (Out-of-vocabulary) problem. In related technologies, the OOV problem can be solved to a certain extent, but there is still a situation where sub-words contain incomplete information, resulting in a low effect of speech recognition for mixed languages.
技术解决方案Technical Solutions
本申请提供了一种语音识别方法、服务器、语音识别系统和可读存储介质。The present application provides a speech recognition method, a server, a speech recognition system and a readable storage medium.
本申请的一种语音识别方法包括:A speech recognition method of the present application includes:
获取识别文本,所述识别文本为对语音请求进行语音识别得到;Acquire a recognition text, where the recognition text is obtained by performing voice recognition on the voice request;
根据预设模型对所述识别文本进行识别,得到第一标签文本;Recognize the recognition text according to a preset model to obtain a first label text;
在确定所述第一标签文本不满足预设条件的情况下,对所述第一标签文本进行修正处理并生成第二标签文本;When it is determined that the first label text does not meet the preset condition, the first label text is corrected and a second label text is generated;
根据所述识别文本和所述第二标签文本生成导航结果。A navigation result is generated according to the recognized text and the second label text.
上述语音识别方法,第一标签文本不满足预设条件,表示通过预设模型识别到的结果未能够表征正确的导航意图,从而在第一标签文本的基础上进行修正处理并生成第二标签文本,以使得第二标签文本能够表征正确的导航意图,并通过识别文本和第二标签文本生成导航结果,有利于提高对混合语言的识别效果。In the above-mentioned speech recognition method, the first label text does not meet the preset conditions, which means that the result recognized by the preset model cannot represent the correct navigation intention. Therefore, correction processing is performed on the basis of the first label text and a second label text is generated, so that the second label text can represent the correct navigation intention, and the navigation result is generated by recognizing the text and the second label text, which is beneficial to improving the recognition effect of mixed languages.
根据预设模型对所述识别文本进行识别,得到第一标签文本,包括:Recognizing the recognition text according to a preset model to obtain a first label text includes:
根据所述预设模型,对所述识别文本进行嵌入处理得到嵌入文本;According to the preset model, embedding processing is performed on the recognized text to obtain embedded text;
根据所述预设模型,对所述识别文本进行编码处理得到编码文本;According to the preset model, encoding the recognized text to obtain an encoded text;
根据所述嵌入文本和所述编码文本生成所述第一标签文本。The first label text is generated according to the embedded text and the encoded text.
如此,可有利于提高对识别文本进行识别的准确率。This can help improve the accuracy of text recognition.
所述语音识别方法包括:The speech recognition method comprises:
确定所述第一标签文本中的导航词标签和兴趣点标签,所述导航词标签对应所述识别文本中的导航词文本,所述兴趣点标签对应所述识别文本中的兴趣点文本;Determine a navigation word label and an interest point label in the first label text, wherein the navigation word label corresponds to the navigation word text in the recognition text, and the interest point label corresponds to the interest point text in the recognition text;
在所述识别文本中位于所述导航词文本和所述兴趣点文本之间的文本部分未对应所述导航词标签或所述兴趣点标签的任意一个的情况下,确定所述第一标签文本不满足预设条件。When a text portion between the navigation word text and the point of interest text in the recognized text does not correspond to any one of the navigation word label or the point of interest label, it is determined that the first label text does not meet the preset condition.
如此,可方便确定是否需要对第一标签文本进行修正的具体方案。In this way, it is convenient to determine whether a specific solution is needed to correct the first label text.
所述语音识别方法包括:The speech recognition method comprises:
在所述识别文本中位于所述导航词文本和所述兴趣点文本之间的文本部分对应所述导航词标签或所述兴趣点标签的情况下,根据所述识别文本和所述第一标签文本生成所述导航结果。In the case where a text portion between the navigation word text and the interest point text in the recognition text corresponds to the navigation word label or the interest point label, the navigation result is generated according to the recognition text and the first label text.
如此,在确定不需要修正时,可直接得到导航结果。In this way, when it is determined that no correction is needed, the navigation result can be obtained directly.
对所述第一标签文本进行修正处理并生成第二标签文本,包括:Correcting the first label text and generating a second label text includes:
根据所述文本部分和所述兴趣点文本,得到至少两个兴趣点组合;Obtain at least two interest point combinations according to the text portion and the interest point text;
根据预设词表和词频特征,确定每个兴趣点组合的得分,所述识别文本的每个词都具有对应的所述词频特征;Determine the score of each interest point combination according to a preset vocabulary and word frequency features, each word of the recognition text having a corresponding word frequency feature;
将兴趣点标签修正为对应具有最高得分的一个兴趣点组合,根据所述导航词标签和修正后的兴趣点标签生成所述第二标签文本。The interest point label is modified to correspond to an interest point combination with the highest score, and the second label text is generated according to the navigation word label and the modified interest point label.
如此,可实现对第一标签文本进行修正的具体方案。In this way, a specific solution for correcting the first label text can be implemented.
所述语音识别方法包括:The speech recognition method comprises:
确定所述识别文本中的至少一个兴趣点实体;determining at least one point of interest entity in the recognized text;
将所述识别文本进行文本拆分得到多个文本片段,每个所述兴趣点实体位于一个所述文本片段内;Splitting the recognized text into a plurality of text segments, each of the interest point entities being located in one of the text segments;
根据预设的标签树,获取每个所述文本片段所对应的标签,所述标签包括导航词标签和兴趣点标签;According to a preset tag tree, a tag corresponding to each of the text segments is obtained, wherein the tag includes a navigation word tag and a point of interest tag;
对所述多个文本片段填充对应的标签,其中,在所述兴趣点实体的数量为至少两个的情况下,将其中一个兴趣点实体填充为兴趣点类型标签,将至少一个兴趣点实体填充为兴趣点限名标签,所述兴趣点标签包括所述兴趣点类型标签和兴趣点限名标签,所述兴趣点类型标签和兴趣点限名标签在所述标签数中具有对应的依附关系。Fill the multiple text fragments with corresponding labels, wherein, when the number of the interest point entities is at least two, one of the interest point entities is filled with an interest point type label, and at least one interest point entity is filled with an interest point limited name label, the interest point labels include the interest point type label and the interest point limited name label, and the interest point type label and the interest point limited name label have a corresponding dependency relationship in the number of labels.
如此,有利于提高对复杂句式的语义理解能力。This will help improve the ability to understand the semantics of complex sentences.
所述语音识别方法包括:The speech recognition method comprises:
根据意图词表和标签映射表,生成核心文本和标签的映射关系;Generate the mapping relationship between core text and tags based on the intent vocabulary and tag mapping table;
对多个所述标签进行语句组合生成多个标签句式,在所述标签句式中,不同的标签具有依附关系;Combining the plurality of labels to generate a plurality of label sentences, wherein different labels have a dependency relationship in the label sentences;
根据所述多个标签句式构建标签树。A tag tree is constructed according to the multiple tag sentence patterns.
如此,可实现对标签树的构建。In this way, the tag tree can be constructed.
本申请的一种服务器包括存储器和处理器,存储器存储有计算机程序,处理器执行所述计算机程序时,实现上述任意一项所述的语音识别方法的步骤。A server of the present application includes a memory and a processor, the memory stores a computer program, and when the processor executes the computer program, the steps of any one of the above-mentioned speech recognition methods are implemented.
上述服务器,第一标签文本不满足预设条件,表示通过预设模型识别到的结果未能够表征正确的导航意图,从而在第一标签文本的基础上进行修正处理并生成第二标签文本,以使得第二标签文本能够表征正确的导航意图,并通过识别文本和第二标签文本生成导航结果,有利于提高对混合语言的识别效果。For the above-mentioned server, the first label text does not meet the preset conditions, indicating that the result recognized by the preset model cannot represent the correct navigation intention, so correction processing is performed on the basis of the first label text and a second label text is generated, so that the second label text can represent the correct navigation intention, and the navigation result is generated by recognizing the text and the second label text, which is conducive to improving the recognition effect of mixed languages.
本申请的一种语音识别系统包括服务器和车辆,所述服务器用于:A speech recognition system of the present application includes a server and a vehicle, wherein the server is used to:
接收语音请求;receiving a voice request;
获取识别文本,所述识别文本为对所述语音请求进行语音识别得到;Acquire a recognition text, where the recognition text is obtained by performing voice recognition on the voice request;
根据预设模型对所述识别文本进行识别,得到第一标签文本;Recognize the recognition text according to a preset model to obtain a first label text;
在确定所述第一标签文本不满足预设条件的情况下,对所述第一标签文本进行修正处理并生成第二标签文本;和If it is determined that the first label text does not meet the preset condition, modifying the first label text and generating a second label text; and
根据所述识别文本和所述第二标签文本生成导航结果;generating a navigation result according to the recognized text and the second label text;
所述车辆用于:The vehicle is used for:
发送所述语音请求;和sending the voice request; and
接收所述导航结果。The navigation result is received.
上述语音识别系统,第一标签文本不满足预设条件,表示通过预设模型识别到的结果未能够表征正确的导航意图,从而在第一标签文本的基础上进行修正处理并生成第二标签文本,以使得第二标签文本能够表征正确的导航意图,并通过识别文本和第二标签文本生成导航结果,有利于提高对混合语言的识别效果。In the above-mentioned speech recognition system, if the first label text does not meet the preset conditions, it means that the result recognized by the preset model cannot represent the correct navigation intention. Therefore, correction processing is performed on the basis of the first label text and a second label text is generated so that the second label text can represent the correct navigation intention. The navigation result is generated by recognizing the text and the second label text, which is beneficial to improving the recognition effect of mixed languages.
本申请的一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序在被处理器执行时,实现上述任意一项所述的语音识别方法的步骤。The present application provides a computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, the steps of any one of the above-mentioned speech recognition methods are implemented.
有益效果Beneficial Effects
上述计算机可读存储介质,第一标签文本不满足预设条件,表示通过预设模型识别到的结果未能够表征正确的导航意图,从而在第一标签文本的基础上进行修正处理并生成第二标签文本,以使得第二标签文本能够表征正确的导航意图,并通过识别文本和第二标签文本生成导航结果,有利于提高对混合语言的识别效果。For the above-mentioned computer-readable storage medium, the first label text does not meet the preset conditions, indicating that the result recognized by the preset model cannot represent the correct navigation intention, so that correction processing is performed on the basis of the first label text and a second label text is generated, so that the second label text can represent the correct navigation intention, and the navigation result is generated by recognizing the text and the second label text, which is conducive to improving the recognition effect of mixed languages.
本申请的附加方面和优点将在下面的描述中部分给出,部分将从下面的描述中变得明显,或通过本申请的实践了解到。Additional aspects and advantages of the present application will be given in part in the description below, and in part will become apparent from the description below, or will be learned through the practice of the present application.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
本申请的上述和/或附加的方面和优点从结合下面附图对本申请的描述中将变得明显和容易理解,其中:The above and/or additional aspects and advantages of the present application will become apparent and easily understood from the description of the present application in conjunction with the following drawings, in which:
图1是本申请的语音识别方法的流程图;FIG1 is a flow chart of the speech recognition method of the present application;
图2是本申请的服务器的模块示意图;FIG2 is a schematic diagram of a module of a server of the present application;
图3是本申请的通过预设模型对识别文本进行识别的示意图;FIG3 is a schematic diagram of recognizing text by using a preset model in the present application;
图4是本申请的标签树的示意图;FIG4 is a schematic diagram of a tag tree of the present application;
图5是本申请的语音识别系统的示意图。FIG5 is a schematic diagram of a speech recognition system of the present application.
主要元件符号说明:Description of main component symbols:
服务器10、存储器11、处理器12;Server 10, memory 11, processor 12;
车辆20、车载终端21;Vehicle 20, vehicle-mounted terminal 21;
语音识别系统30。Speech recognition system 30.
本发明的实施方式Embodiments of the present invention
下面详细描述本申请的实施方式,所述实施方式的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施方式是示例性的,仅用于解释本申请,而不能理解为对本申请的限制。The embodiments of the present application are described in detail below, and examples of the embodiments are shown in the accompanying drawings, wherein the same or similar reference numerals throughout represent the same or similar elements or elements having the same or similar functions. The embodiments described below with reference to the accompanying drawings are exemplary and are only used to explain the present application, and cannot be understood as limiting the present application.
请参考图1,本申请的一种语音识别方法,包括:Please refer to FIG1 , a speech recognition method of the present application includes:
01:获取识别文本,识别文本为对语音请求进行语音识别得到;01: Get the recognized text, which is obtained by performing voice recognition on the voice request;
02:根据预设模型对识别文本进行识别,得到第一标签文本;02: Recognize the recognition text according to the preset model to obtain the first label text;
03:在确定第一标签文本不满足预设条件的情况下,对第一标签文本进行修正处理并生成第二标签文本;03: When it is determined that the first label text does not meet the preset condition, the first label text is corrected and a second label text is generated;
04:根据识别文本和第二标签文本生成导航结果。04: Generate navigation results based on the recognized text and the second label text.
本申请的语音识别方法可以通过本申请的服务器10来实现。具体地,请结合图2,服务器10包括存储器11和处理器12。存储器11存储有计算机程序。处理器12能够执行计算机程序以实现本申请的语音识别方法的步骤。具体地,处理器12用于:获取识别文本,识别文本为对语音请求进行语音识别得到;根据预设模型对识别文本进行识别,得到第一标签文本;在确定第一标签文本不满足预设条件的情况下,对第一标签文本进行修正处理并生成第二标签文本;根据识别文本和第二标签文本生成导航结果。The speech recognition method of the present application can be implemented by the server 10 of the present application. Specifically, please refer to Figure 2, the server 10 includes a memory 11 and a processor 12. The memory 11 stores a computer program. The processor 12 can execute the computer program to implement the steps of the speech recognition method of the present application. Specifically, the processor 12 is used to: obtain the recognition text, the recognition text is obtained by performing speech recognition on the voice request; recognize the recognition text according to a preset model to obtain a first label text; when it is determined that the first label text does not meet the preset conditions, correct the first label text and generate a second label text; generate a navigation result based on the recognition text and the second label text.
上述语音识别方法和服务器10,第一标签文本不满足预设条件,表示通过预设模型识别到的结果未能够表征正确的导航意图,从而在第一标签文本的基础上进行修正处理并生成第二标签文本,以使得第二标签文本能够表征正确的导航意图,并通过识别文本和第二标签文本生成导航结果,有利于提高对混合语言的识别效果。In the above-mentioned speech recognition method and server 10, the first label text does not meet the preset conditions, indicating that the result recognized by the preset model cannot represent the correct navigation intention, so that correction processing is performed on the basis of the first label text and a second label text is generated, so that the second label text can represent the correct navigation intention, and the navigation result is generated by recognizing the text and the second label text, which is beneficial to improving the recognition effect of mixed languages.
其中,语音请求对应接收到的由用户发出的语音信息。对于语音请求而言,其可以为混合语言。混合语言中可能包括不同语种的语言。The voice request corresponds to the voice information received from the user. The voice request may be in a mixed language. The mixed language may include different languages.
识别文本为对语音请求进行语音识别得到的文本。对语音请求进行语音识别可以通过ASR(Automatic Speech Recognition)的方式实现。The recognized text is a text obtained by performing speech recognition on the speech request. The speech recognition on the speech request may be implemented by using ASR (Automatic Speech Recognition).
另外,在获取到识别文本后,可以先对识别文本进行语义识别,从而可确定用户的语音请求所属的领域。在一个应用场景中,语音请求可以为“今天天气如何”,则对应的识别文本可确定属于天气相关的领域。在一个应用场景中,语音请求可以为“换一首歌”,则对应的识别文本可确定属于音乐相关的领域。在一个应用场景中,语音请求可以为“导航去火车站”,则对应的识别文本可确定属于导航相关的领域。在确定识别文本所属的领域后,可以通过中控将识别文本分发至对应领域的处理模块中。In addition, after obtaining the recognition text, semantic recognition can be performed on the recognition text first, so as to determine the field to which the user's voice request belongs. In an application scenario, the voice request may be "How is the weather today?", then the corresponding recognition text can be determined to belong to the weather-related field. In an application scenario, the voice request may be "Change a song", then the corresponding recognition text can be determined to belong to the music-related field. In an application scenario, the voice request may be "Navigate to the train station", then the corresponding recognition text can be determined to belong to the navigation-related field. After determining the field to which the recognition text belongs, the recognition text can be distributed to the processing module of the corresponding field through the central control.
在识别文本被分发至对应领域的处理模块中的情况下,可以通过预设模型来对识别文本中的POI(Point of Interest,兴趣点)进行识别。例如,在识别文本为混合语言的情况下,可以通过预设模型来提高对识别文本中的POI的泛化性,即使在识别文本中对应POI的部分文本属于训练集之外,也可以识别出相应的POI。在识别出相应的POI后,则可以将识别结果上传给规则引擎,判断识别结果是否需要修正。When the recognized text is distributed to the processing module of the corresponding field, the POI (Point of Interest) in the recognized text can be recognized through a preset model. For example, when the recognized text is in a mixed language, the generalization of the POI in the recognized text can be improved through a preset model, and the corresponding POI can be recognized even if part of the text corresponding to the POI in the recognized text is outside the training set. After the corresponding POI is recognized, the recognition result can be uploaded to the rule engine to determine whether the recognition result needs to be corrected.
具体地,识别结果包括第一标签文本。第一标签文本为在对识别文本进行POI识别的过程中,根据识别文本的相应内容的类型生成对应的标签,从而得到的多个标签所形成的文本信息。根据第一标签文本可确定识别文本中主要的导航相关的信息。Specifically, the recognition result includes a first label text. The first label text is text information formed by multiple labels obtained by generating corresponding labels according to the type of corresponding content of the recognized text during the process of performing POI recognition on the recognized text. The main navigation-related information in the recognized text can be determined based on the first label text.
其中,在得到第一标签文本后,可对第一标签文本进行是否满足预设条件的判断,根据判断结果可确定识别结果是否足够准确。在确定第一标签文本不满足预设条件的情况下,则会对第一标签文本进行修正处理以生成第二标签文本,并使得第二标签文本能够正确表征识别文本中的导航信息,进而再根据识别文本和第二标签文本来生成导航结果,用户可根据导航结果来确定识别文本中的POI的相关信息。After obtaining the first label text, it can be determined whether the first label text meets the preset conditions, and whether the recognition result is sufficiently accurate can be determined based on the judgment result. If it is determined that the first label text does not meet the preset conditions, the first label text will be corrected to generate a second label text, and the second label text can correctly represent the navigation information in the recognition text, and then the navigation result is generated based on the recognition text and the second label text, and the user can determine the relevant information of the POI in the recognition text based on the navigation result.
步骤02(根据预设模型对识别文本进行识别,得到第一标签文本),包括:Step 02 (recognizing the recognition text according to the preset model to obtain the first label text) includes:
根据预设模型,对识别文本进行嵌入处理得到嵌入文本;According to the preset model, the recognized text is embedded to obtain the embedded text;
根据预设模型,对识别文本进行编码处理得到编码文本;According to the preset model, the recognized text is encoded to obtain the encoded text;
根据嵌入文本和编码文本生成第一标签文本。A first label text is generated according to the embedded text and the encoded text.
本申请的语音识别方法可以通过本申请的服务器10来实现。具体地,请结合图2,处理器12用于:根据预设模型,对识别文本进行嵌入处理得到嵌入文本;根据预设模型,对识别文本进行编码处理得到编码文本;根据嵌入文本和编码文本生成第一标签文本。The speech recognition method of the present application can be implemented by the server 10 of the present application. Specifically, referring to FIG2 , the processor 12 is used to: embed the recognized text according to a preset model to obtain an embedded text; encode the recognized text according to the preset model to obtain an encoded text; and generate a first label text according to the embedded text and the encoded text.
如此,可有利于提高对识别文本进行识别的准确率。This can help improve the accuracy of text recognition.
具体地,请结合图3,在根据预设模型对识别文本进行识别的情况下,会分别对识别文本进行嵌入处理和编码处理。在嵌入处理中,会对识别文本进行语言嵌入(language embedding)、位置嵌入(position embedding)和标记嵌入(token embedding)。在编码处理中,会对识别文本进行特征编码(character encoder)。在完成嵌入处理和编码处理后,则可将处理结果输入转换器中进行转换来生成第一标签文本。对识别文本进行的嵌入处理可以通过mbert模型(Multilingual Bidirectional Encoder Representations from Transformer,多语言BERT)来实现。对识别文本进行的编码处理可以通过charbert模型来实现。Specifically, please refer to Figure 3. When the recognition text is recognized according to the preset model, the recognition text will be embedded and encoded respectively. In the embedding process, the recognition text will be language embedded (language embedding), position embedded (position embedding) and token embedded (token embedding). In the encoding process, the recognition text will be feature encoded (character encoder). After completing the embedding process and the encoding process, the processing result can be input into the converter for conversion to generate the first label text. The embedding process of the recognition text can be implemented by the BERT model (Multilingual Bidirectional Encoder Representations from Transformer, multilingual BERT). The encoding process of the recognition text can be implemented by the charbert model.
在上述基础上,通过结合charbert模型和mbert模型进行POI识别,能够针对小语种词频低的POI有很好的泛化性,与传统的预训练模型(比如bert模型)对比,对混合语言边界模糊的POI的识别效果更好,在对混合语言进行识别的场景下能够提高3%的整体准确率。On the basis of the above, by combining the Charbert model and the Bert model for POI recognition, it can have good generalization for POIs with low word frequency in small languages. Compared with traditional pre-training models (such as the Bert model), it has better recognition effect on POIs with blurred boundaries in mixed languages, and can improve the overall accuracy by 3% in the scenario of mixed language recognition.
在一个应用场景中,识别文本为“go to Sykehus near Det juridiske fakultet”。在根据预设模型对识别文本进行识别后可得到多个文本信息:“go—O”、“to—O”、“Sykehus—S-POI”、“near—O”、“Det—B-POI”、“juridiske—I-POI”、“fakultet—E-POI”。其中,“go—O”、“to—O”、“near—O”分别表示为将识别文本中的“go”、“to”、“near”识别为其他类型的实体词,“Sykehus—S-POI”表示为将识别文本中的“Sykehus”识别为表征POI的单体词,“Det—B-POI”表示为将识别文本中的“Det”识别为表征POI的组合词的起始部分,“juridiske—I-POI”表示为将识别文本中的“juridiske”识别为表征POI的组合词的中间部分,“fakultet—E-POI”表示为将识别文本中的“fakultet”识别为表征POI的组合词的结束部分。最终得到的第一标签文本可以为:In an application scenario, the recognized text is "go to Sykehus near Det juridiske fakultet". After recognizing the recognized text according to the preset model, multiple text information can be obtained: "go—O", "to—O", "Sykehus—S-POI", "near—O", "Det—B-POI", "juridiske—I-POI", "fakultet—E-POI". Among them, "go—O", "to—O", and "near—O" respectively indicate that "go", "to", and "near" in the recognition text are recognized as other types of entity words, "Sykehus—S-POI" indicates that "Sykehus" in the recognition text is recognized as a single word representing POI, "Det—B-POI" indicates that "Det" in the recognition text is recognized as the starting part of the compound word representing POI, "juridiske—I-POI" indicates that "juridiske" in the recognition text is recognized as the middle part of the compound word representing POI, and "fakultet—E-POI" indicates that "fakultet" in the recognition text is recognized as the ending part of the compound word representing POI. The first label text finally obtained can be:
"entities": [{"word": "Sykehus","start": 2,"end": 3,"type": "POI"}, {"word": "Det juridiske fakultet","start": 4,"end": 7,"type": "POI"}]。"entities": [{"word": "Sykehus","start": 2,"end": 3,"type": "POI"}, {"word": "Det juridiske fakultet","start": 4,"end": 7,"type": "POI"}].
语音识别方法包括:Speech recognition methods include:
确定第一标签文本中的导航词标签和兴趣点标签,导航词标签对应识别文本中的导航词文本,兴趣点标签对应识别文本中的兴趣点文本;Determine a navigation word label and an interest point label in the first label text, wherein the navigation word label corresponds to the navigation word text in the recognition text, and the interest point label corresponds to the interest point text in the recognition text;
在识别文本中位于导航词文本和兴趣点文本之间的文本部分未对应导航词标签或兴趣点标签的任意一个的情况下,确定第一标签文本不满足预设条件。When the text portion between the navigation word text and the point of interest text in the recognition text does not correspond to any one of the navigation word label or the point of interest label, it is determined that the first label text does not meet the preset condition.
本申请的语音识别方法可以通过本申请的服务器10来实现。具体地,请结合图2,处理器12用于:确定第一标签文本中的导航词标签和兴趣点标签,导航词标签对应识别文本中的导航词文本,兴趣点标签对应识别文本中的兴趣点文本;在识别文本中位于导航词文本和兴趣点文本之间的文本部分未对应导航词标签或兴趣点标签的任意一个的情况下,确定第一标签文本不满足预设条件。The speech recognition method of the present application can be implemented by the server 10 of the present application. Specifically, referring to FIG. 2 , the processor 12 is used to: determine the navigation word label and the point of interest label in the first label text, the navigation word label corresponds to the navigation word text in the recognition text, and the point of interest label corresponds to the point of interest text in the recognition text; if the text portion between the navigation word text and the point of interest text in the recognition text does not correspond to any one of the navigation word label or the point of interest label, determine that the first label text does not meet the preset condition.
如此,可方便确定是否需要对第一标签文本进行修正的具体方案。In this way, it is convenient to determine whether a specific solution is needed to correct the first label text.
具体地,在一些应用场景中,根据预设模型对识别文本进行识别可能会存在识别错误。对于语音请求为“i want to go Too Good To Go Norge”的场景,实际的POI为“Too Good To Go Norge”,而对识别文本中识别得到的POI则可能为“Norge”。对于语音请求为“find me the quickest route to The Big 5 AS”的场景,实际的POI为“The Big 5 AS”,而对识别文本中识别时可能会将“The”遗漏,使得得到的POI为“Big 5 AS”。对于语音请求为“search for A 2 Pas Quadris”的场景,实际的POI为“A 2 Pas Quadris”,而对识别文本中识别时可能会将“A”遗漏,使得得到的POI为“2 Pas Quadris”。上述的场景在对混合语言进行POI识别时较容易发生。Specifically, in some application scenarios, there may be recognition errors when recognizing the recognition text according to the preset model. For the scenario where the voice request is "i want to go Too Good To Go Norge", the actual POI is "Too Good To Go Norge", while the POI recognized in the recognition text may be "Norge". For the scenario where the voice request is "find me the quickest route to The Big 5 AS", the actual POI is "The Big 5 AS", but "The" may be omitted when recognizing in the recognition text, so that the obtained POI is "Big 5 AS". For the scenario where the voice request is "search for A 2 Pas Quadris", the actual POI is "A 2 Pas Quadris", but "A" may be omitted when recognizing in the recognition text, so that the obtained POI is "2 Pas Quadris". The above scenarios are more likely to occur when performing POI recognition on mixed languages.
以语音请求为“i want to go Too Good To Go Norge”的场景进行说明,在得到对应的第一标签文本的情况下,则会将“i want to go”填充导航词标签,以及将“Norge”填充兴趣点标签。由于识别文本中位于导航词文本和兴趣点文本之间的文本部分,即“Too Good To Go”会无法识别出对应类型的实体词,从而无法填充相应的标签,进而可认为存在识别错误的问题,需要对识别结果进行修正,从而确定第一标签文本不满足预设条件。For example, in the case of a voice request of "i want to go Too Good To Go Norge", when the corresponding first label text is obtained, the navigation word label "i want to go" will be filled in, and the point of interest label "Norge" will be filled in. Since the text part between the navigation word text and the point of interest text in the recognition text, i.e. "Too Good To Go", cannot be recognized as the corresponding type of entity word, the corresponding label cannot be filled in, and it can be considered that there is a recognition error problem, and the recognition result needs to be corrected to determine that the first label text does not meet the preset conditions.
语音识别方法包括:Speech recognition methods include:
在识别文本中位于导航词文本和兴趣点文本之间的文本部分对应导航词标签或兴趣点标签的情况下,根据识别文本和第一标签文本生成导航结果。In the case where a text portion between the navigation word text and the point of interest text in the recognition text corresponds to a navigation word label or a point of interest label, a navigation result is generated according to the recognition text and the first label text.
本申请的语音识别方法可以通过本申请的服务器10来实现。具体地,请结合图2,处理器12用于:在识别文本中位于导航词文本和兴趣点文本之间的文本部分对应导航词标签或兴趣点标签的情况下,根据识别文本和第一标签文本生成导航结果。The speech recognition method of the present application can be implemented by the server 10 of the present application. Specifically, referring to FIG. 2 , the processor 12 is used to: generate a navigation result according to the recognition text and the first label text when the text portion between the navigation word text and the point of interest text in the recognition text corresponds to the navigation word label or the point of interest label.
如此,在确定不需要修正时,可直接得到导航结果。In this way, when it is determined that no correction is needed, the navigation result can be obtained directly.
具体地,在识别文本中位于导航词文本和兴趣点文本之间的文本部分对应导航词标签或兴趣点标签的情况下,则可确定第一标签文本满足预设条件,通过第一标签文本则可以正确识别出识别文本,从而可直接根据识别文本结合第一标签文本生成导航结果。Specifically, when the text portion between the navigation word text and the point of interest text in the recognition text corresponds to a navigation word label or a point of interest label, it can be determined that the first label text meets the preset conditions, and the recognition text can be correctly identified through the first label text, so that the navigation result can be generated directly based on the recognition text combined with the first label text.
在上述基础上,预设条件可以理解为,用于确定第一标签文本是否可用于直接生成导航结果。Based on the above, the preset condition can be understood as being used to determine whether the first label text can be used to directly generate a navigation result.
步骤03(对第一标签文本进行修正处理并生成第二标签文本),包括:Step 03 (correcting the first label text and generating the second label text) includes:
根据文本部分和兴趣点文本,得到至少两个兴趣点组合;Obtain at least two interest point combinations according to the text part and the interest point text;
根据预设词表和词频特征,确定每个兴趣点组合的得分,识别文本的每个词都具有对应的词频特征;According to the preset word list and word frequency features, the score of each interest point combination is determined, and each word of the recognition text has a corresponding word frequency feature;
将兴趣点标签修正为对应具有最高得分的一个兴趣点组合,根据导航词标签和修正后的兴趣点标签生成第二标签文本。The interest point label is modified to correspond to an interest point combination with the highest score, and a second label text is generated according to the navigation word label and the modified interest point label.
本申请的语音识别方法可以通过本申请的服务器10来实现。具体地,请结合图2,处理器12用于:根据文本部分和兴趣点文本,得到至少两个兴趣点组合;根据预设词表和词频特征,确定每个兴趣点组合的得分,识别文本的每个词都具有对应的词频特征;将兴趣点标签修正为对应具有最高得分的一个兴趣点组合,根据导航词标签和修正后的兴趣点标签生成第二标签文本。The speech recognition method of the present application can be implemented by the server 10 of the present application. Specifically, please refer to FIG2, the processor 12 is used to: obtain at least two interest point combinations according to the text part and the interest point text; determine the score of each interest point combination according to the preset word list and word frequency characteristics, and each word of the recognition text has a corresponding word frequency feature; correct the interest point label to correspond to an interest point combination with the highest score, and generate a second label text according to the navigation word label and the corrected interest point label.
如此,可实现对第一标签文本进行修正的具体方案。In this way, a specific solution for correcting the first label text can be implemented.
具体地,在一个应用场景中,语音请求为“i want to go Too Good To Go Norge”。在确定第一标签文本不满足预设条件的情况下,根据“Too Good To Go”、“Norge”按照最长优先匹配计算可得到多个兴趣点组合,如“Too Good To Go Norge”、“Good To Go Norge”、“To Go Norge”、“Go Norge”、“Norge”。Specifically, in an application scenario, the voice request is "I want to go Too Good To Go Norge". When it is determined that the first label text does not meet the preset condition, multiple interest point combinations can be obtained according to "Too Good To Go" and "Norge" according to the longest first match calculation, such as "Too Good To Go Norge", "Good To Go Norge", "To Go Norge", "Go Norge", and "Norge".
在确定对应的一个兴趣点组合后,可以根据预设词表来查找兴趣点组合中所有词共现的次数。预设词表可以为多国家POI词表。预设词表可以通过开源数据和相应的合作方提供来获取。所有词共现的次数可以对应二元组与三元组共现词频,从而可确定兴趣点组合中所有词同时出现在同一个POI中的次数(加权共现词频特征)。在一个应用场景中,兴趣点组合“Too Good”同时出现在同一个POI中的次数为21,从而可得到结果 (Too,Good,21),兴趣点组合“Too Good To”同时出现在同一个POI中的次数为17,从而可得到结果 (Too,Good,To,17)。After determining a corresponding POI combination, the number of times all words in the POI combination co-occur can be found according to the preset word list. The preset word list can be a multi-country POI word list. The preset word list can be obtained through open source data and corresponding partners. The number of times all words co-occur can correspond to the co-occurrence frequency of bigrams and trigrams, so as to determine the number of times all words in the POI combination appear in the same POI at the same time (weighted co-occurrence frequency feature). In an application scenario, the number of times the POI combination "Too Good" appears in the same POI at the same time is 21, so the result (Too, Good, 21) can be obtained, and the number of times the POI combination "Too Good To" appears in the same POI at the same time is 17, so the result (Too, Good, To, 17) can be obtained.
兴趣点组合中的每个词(实体词)都具有对应的词频特征。词频特征可以为通过词频(tf,term frequency)和逆文本频率指数(idf,inverse document frequency)来计算得到。在计算得到词频特征后,可以将计算结果进行存储,在需要调用时可以直接获取。Each word (entity word) in the interest point combination has a corresponding word frequency feature. The word frequency feature can be calculated by word frequency (tf, term frequency) and inverse document frequency index (idf, inverse document frequency). After the word frequency feature is calculated, the calculation result can be stored and directly obtained when needed.
对于每个兴趣点组合而言,可通过如下的公式来计算确定相应的得分:For each combination of interest points, the corresponding score can be calculated by the following formula:
S=sum(tf*idf)*Fp*FwS=sum(tf*idf)*Fp*Fw
其中,tf*idf表示兴趣点组合中特定的一个词的词频特征,sum(tf*idf)表示兴趣点组合中所有词的词频特征的总和,Fp表示惩罚因子,Fw表示加权共现词频特征。惩罚因子可以根据具体的兴趣点组合的具体形式、当前POI识别业务来确定。不同的兴趣点组合可以具有不同的惩罚因子。Among them, tf*idf represents the frequency feature of a specific word in the interest point combination, sum(tf*idf) represents the sum of the frequency features of all words in the interest point combination, Fp represents the penalty factor, and Fw represents the weighted co-occurrence frequency feature. The penalty factor can be determined according to the specific form of the specific interest point combination and the current POI recognition business. Different interest point combinations can have different penalty factors.
在得到所有兴趣点组合的得分后,则可根据得分最高的一个兴趣点组合来确定对应的兴趣点标签。在一个应用场景中,在确定得分最高的一个兴趣点组合为“Too Good To Go Norge”的情况下,则会将“Too Good To Go Norge”填充为兴趣点标签,导航词标签仍为“i want to go”,从而根据修正后的结果得到第二标签文本。After obtaining the scores of all interest point combinations, the corresponding interest point label can be determined according to the interest point combination with the highest score. In an application scenario, when it is determined that the interest point combination with the highest score is "Too Good To Go Norge", "Too Good To Go Norge" will be filled as the interest point label, and the navigation word label is still "i want to go", so as to obtain the second label text according to the corrected result.
语音识别方法包括:Speech recognition methods include:
确定识别文本中的至少一个兴趣点实体;Determining at least one point of interest entity in the recognized text;
将识别文本进行文本拆分得到多个文本片段,每个兴趣点实体位于一个文本片段内;The recognized text is split into multiple text segments, and each point of interest entity is located in one text segment;
根据预设的标签树,获取每个文本片段所对应的标签,标签包括导航词标签和兴趣点标签;According to the preset tag tree, obtain the tag corresponding to each text segment, which includes navigation word tags and point of interest tags;
对多个文本片段填充对应的标签,其中,在兴趣点实体的数量为至少两个的情况下,将其中一个兴趣点实体填充为兴趣点类型标签,将至少一个兴趣点实体填充为兴趣点限名标签,兴趣点标签包括兴趣点类型标签和兴趣点限名标签,兴趣点类型标签和兴趣点限名标签在标签树中具有对应的依附关系。Corresponding labels are filled in multiple text fragments, wherein, when the number of interest point entities is at least two, one of the interest point entities is filled with an interest point type label, and at least one interest point entity is filled with an interest point limited name label, the interest point labels include interest point type labels and interest point limited name labels, and the interest point type labels and interest point limited name labels have corresponding dependency relationships in the label tree.
本申请的语音识别方法可以通过本申请的服务器10来实现。具体地,请结合图2,处理器12用于:确定识别文本中的至少一个兴趣点实体;将识别文本进行文本拆分得到多个文本片段,每个兴趣点实体位于一个文本片段内;根据预设的标签树,获取每个文本片段所对应的标签,标签包括导航词标签和兴趣点标签;对多个文本片段填充对应的标签,其中,在兴趣点实体的数量为至少两个的情况下,将其中一个兴趣点实体填充为兴趣点类型标签,将至少一个兴趣点实体填充为兴趣点限名标签,兴趣点标签包括兴趣点类型标签和兴趣点限名标签,兴趣点类型标签和兴趣点限名标签在标签树中具有对应的依附关系。The speech recognition method of the present application can be implemented by the server 10 of the present application. Specifically, please refer to Figure 2, the processor 12 is used to: determine at least one point of interest entity in the recognized text; split the recognized text into multiple text segments, each point of interest entity is located in a text segment; according to the preset tag tree, obtain the label corresponding to each text segment, the label includes a navigation word label and a point of interest label; fill the corresponding labels for the multiple text segments, wherein, when the number of point of interest entities is at least two, one of the point of interest entities is filled with a point of interest type label, and at least one point of interest entity is filled with a point of interest limited name label, the point of interest label includes a point of interest type label and a point of interest limited name label, and the point of interest type label and the point of interest limited name label have a corresponding dependency relationship in the tag tree.
如此,有利于提高对复杂句式的语义理解能力。This will help improve the ability to understand the semantics of complex sentences.
在一个应用场景中,识别文本为“Please go to Sykehus on my way Det juridiske fakultet go highways”。通过POI识别可确定识别文本中的兴趣点包括“Sykehus”和“Det juridiske fakultet”。然后对识别文本进行拆分得到多个文本片段:“Please go to”、“Sykehus”、“on my way”、“Det juridiske fakultet”、“go highways”。兴趣点“Sykehus”位于第二个文本片段内。兴趣点“Det juridiske fakultet”位于第四个文本片段内。In an application scenario, the recognized text is "Please go to Sykehus on my way Det juridiske fakultet go highways". Through POI recognition, it can be determined that the points of interest in the recognized text include "Sykehus" and "Det juridiske fakultet". Then the recognized text is split into multiple text segments: "Please go to", "Sykehus", "on my way", "Det juridiske fakultet", "go highways". The point of interest "Sykehus" is located in the second text segment. The point of interest "Det juridiske fakultet" is located in the fourth text segment.
在得到所有的文本片段后,则会根据导航对应的实体词表进行正则识别,使得文本片段能够尽可能靠近标签树中的标签主要映射的文本。具体地,“Please go to”在实体词表中具有相近词义的文本可包括“navigate to”,从而可将“Please go to”识别为“navigate to”。“on my way”、“go highways”能够在实体词表中查找得到,从而会分别识别为“on my way”、“go highways”。After all the text fragments are obtained, regular expression recognition will be performed according to the entity vocabulary corresponding to navigation, so that the text fragments can be as close as possible to the text mainly mapped by the tag in the tag tree. Specifically, the text with similar meanings of "Please go to" in the entity vocabulary may include "navigate to", so "Please go to" can be recognized as "navigate to". "On my way" and "go highways" can be found in the entity vocabulary, and thus they will be recognized as "on my way" and "go highways" respectively.
在完成正则识别后,则根据实体词表和标签树中的标签之间的映射关系,将识别到的文本映射为对应的标签。具体地,根据上述的映射关系,“go to”所对应的标签为“kw_navigate(知识:导航)”,“on my way”所对应的标签为“on_my_way”,“go highways”所对应的标签为“route_preference”。After regular expression recognition is completed, the recognized text is mapped to the corresponding label according to the mapping relationship between the entity vocabulary and the label tree. Specifically, according to the above mapping relationship, the label corresponding to "go to" is "kw_navigate (knowledge: navigation)", the label corresponding to "on my way" is "on_my_way", and the label corresponding to "go highways" is "route_preference".
在兴趣点实体的数量为至少两个的情况下,可以理解,若将所有的兴趣点实体均填充为兴趣点标签,则可能会将所有的兴趣点实体所对应的实际地点都作为导航目的地。在前述内容的基础上,根据标签树中存在的兴趣点类型标签和兴趣点限名标签之间的依附关系,则可以明确识别文本中多个兴趣点实体之间的逻辑关系。具体地,兴趣点“Sykehus”为实际的需要前往的导航目的地,兴趣点“Det juridiske fakultet”则表征与导航目的地之间的位置关系,可用于确定导航目的地所在的位置。也就是说,兴趣点“Sykehus”所对应的标签应该为兴趣点类型标签,兴趣点“Det juridiske fakultet”所对应的标签应该为兴趣点限名标签,从而在进行标签填充的过程中,将兴趣点“Sykehus”的标签填充为兴趣点类型标签(POI_type),以及将兴趣点“Det juridiske fakultet”的标签填充为兴趣点限名标签(limit_name),从而实现对识别文本的语义标签转写的效果,在实际的语义识别场景中,则有利于提高对复杂句式的语义理解能力,减少由于无法区分识别出的多个兴趣点而导致识别错误的情况。In the case where the number of point of interest entities is at least two, it can be understood that if all point of interest entities are filled with point of interest tags, the actual locations corresponding to all point of interest entities may be used as navigation destinations. On the basis of the foregoing, according to the dependency relationship between the point of interest type tags and the point of interest limited name tags in the tag tree, the logical relationship between multiple point of interest entities in the text can be clearly identified. Specifically, the point of interest "Sykehus" is the actual navigation destination to be traveled to, and the point of interest "Det juridiske fakultet" represents the positional relationship with the navigation destination, which can be used to determine the location of the navigation destination. That is to say, the label corresponding to the point of interest "Sykehus" should be the point of interest type label, and the label corresponding to the point of interest "Det juridiske fakultet" should be the point of interest limited name label. Therefore, in the process of label filling, the label of the point of interest "Sykehus" is filled with the point of interest type label (POI_type), and the label of the point of interest "Det juridiske fakultet" is filled with the point of interest limited name label (limit_name), thereby achieving the effect of transcribing the semantic labels of the recognized text. In the actual semantic recognition scenario, it is beneficial to improve the semantic understanding ability of complex sentences and reduce the recognition errors caused by the inability to distinguish multiple recognized points of interest.
语音识别方法包括:Speech recognition methods include:
根据意图词表和标签映射表,生成核心文本和标签的映射关系;Generate the mapping relationship between core text and tags based on the intent vocabulary and tag mapping table;
对多个标签进行语句组合生成多个标签句式,在标签句式中,不同的标签具有依附关系;Combining multiple labels into multiple label sentences, in which different labels have dependency relationships;
根据多个标签句式构建标签树。Construct a tag tree based on multiple tag sentences.
本申请的语音识别方法可以通过本申请的服务器10来实现。具体地,请结合图2,处理器12用于:根据意图词表和标签映射表,生成核心文本和标签的映射关系;对多个标签进行语句组合生成多个标签句式,在标签句式中,不同的标签具有依附关系;根据多个标签句式构建标签树。The speech recognition method of the present application can be implemented by the server 10 of the present application. Specifically, referring to FIG. 2 , the processor 12 is used to: generate a mapping relationship between core text and tags according to the intention vocabulary and the tag mapping table; perform sentence combination on multiple tags to generate multiple tag sentences, in which different tags have a dependency relationship; and construct a tag tree according to multiple tag sentences.
如此,可实现对标签树的构建。In this way, the tag tree can be constructed.
请结合图4,图4所示为可实现的一个标签树。具体地,根据核心意图词表和标签映射表,可以将位于核心意图词表中的核心文本映射至标签映射表中。在对识别文本进行识别的过程中,若识别出相应的文本片段为核心文本,或其近似词义的文本为核心文本,则可以将文本片段映射到标签映射表中对应的标签。在一个应用场景中,核心文本“navigate to”可以映射至标签“K:navigate”(可对应图4所示的“知识:导航”)。Please refer to Figure 4, which shows an achievable label tree. Specifically, according to the core intent vocabulary and the label mapping table, the core text located in the core intent vocabulary can be mapped to the label mapping table. In the process of recognizing the recognized text, if the corresponding text segment is identified as the core text, or the text with similar meaning is the core text, the text segment can be mapped to the corresponding label in the label mapping table. In an application scenario, the core text "navigate to" can be mapped to the label "K: navigate" (corresponding to "Knowledge: Navigation" shown in Figure 4).
在确定上述的映射关系的情况下,根据核心文本之间的语义关系,可以将多个标签进行语句组合来得到对应的标签句式。在一个应用场景中,识别文本为“go to KFC”,其中,“go to”与“navigate to”的词义相近,“KFC”则被识别为兴趣点,从而可得到对应的标签句式为“K:navigate POI_type”,其中,“POI_type”可对应图4中的“兴趣点_类型”,标签句式中的标签“K:navigate”和标签“POI_type”则形成依附关系。其中,标签“POI_type”对应兴趣点类型标签。In the case of determining the above mapping relationship, according to the semantic relationship between the core texts, multiple tags can be combined into sentences to obtain the corresponding label sentence. In an application scenario, the recognized text is "go to KFC", where "go to" and "navigate to" have similar meanings, and "KFC" is recognized as a point of interest, so that the corresponding label sentence can be obtained as "K: navigate POI_type", where "POI_type" can correspond to "point of interest_type" in Figure 4, and the label "K: navigate" and the label "POI_type" in the label sentence form a dependent relationship. Among them, the label "POI_type" corresponds to the point of interest type label.
在上述基础上,在根据具体的语句组合得到多个标签句式的情况下,则可将多个标签句式进行整合,最终得到标签树。在一个应用场景中,标签树的原始文件格式为:"template":"[D:POI_NAME@poi_name][K:nearby][D:POI_ADDRESS|DISTRICT@limit_address]"On the basis of the above, when multiple label sentences are obtained according to specific sentence combinations, multiple label sentences can be integrated to finally obtain a label tree. In an application scenario, the original file format of the label tree is: "template":"[D:POI_NAME@poi_name][K:nearby][D:POI_ADDRESS|DISTRICT@limit_address]"
另外,在图4中,标签“知识:附近”(K:nearby)所映射的核心文本可以包括“close to”、“near”。标签“限定_名称”(limit_name)则对应兴趣点限名标签。In addition, in FIG4 , the core text mapped by the tag “K: nearby” may include “close to” and “near.” The tag “limit_name” corresponds to the interest point limit name tag.
综上所述,本申请的语音识别方法,可实现如下效果:In summary, the speech recognition method of the present application can achieve the following effects:
1、提出了混合语言的导航语义理解方案,可扩展在一个国家多种语言场景;1. A mixed-language navigation semantic understanding solution is proposed, which can be extended to multiple language scenarios in a country;
2、通过基于char+mbert的POI提取算法,可有利于提取混合语言中的兴趣点;2. The POI extraction algorithm based on char+mbert can help extract points of interest in mixed languages;
3、通过修正处理,可减少混合语言中的POI受英文表达影响的程度;3. Through correction processing, the degree to which POIs in mixed languages are affected by English expressions can be reduced;
4、支持可选择路线偏好的语义理解。4. Support semantic understanding of selectable route preferences.
请参考图5,本申请的一种语音识别系统30,包括服务器10和车辆20。服务器10用于:接收语音请求;获取识别文本,识别文本为对语音请求进行语音识别得到;根据预设模型对识别文本进行识别,得到第一标签文本;在确定第一标签文本不满足预设条件的情况下,对第一标签文本进行修正处理并生成第二标签文本;和根据识别文本和第二标签文本生成导航结果。车辆20用于:发送语音请求;和接收导航结果。Please refer to FIG5 , a speech recognition system 30 of the present application includes a server 10 and a vehicle 20. The server 10 is used to: receive a speech request; obtain a recognition text, the recognition text is obtained by performing speech recognition on the speech request; recognize the recognition text according to a preset model to obtain a first label text; if it is determined that the first label text does not meet the preset conditions, correct the first label text and generate a second label text; and generate a navigation result according to the recognition text and the second label text. The vehicle 20 is used to: send a speech request; and receive a navigation result.
上述语音识别系统30,第一标签文本不满足预设条件,表示通过预设模型识别到的结果未能够表征正确的导航意图,从而在第一标签文本的基础上进行修正处理并生成第二标签文本,以使得第二标签文本能够表征正确的导航意图,并通过识别文本和第二标签文本生成导航结果,有利于提高对混合语言的识别效果。In the above-mentioned speech recognition system 30, the first label text does not meet the preset conditions, indicating that the result recognized by the preset model cannot represent the correct navigation intention, so that correction processing is performed on the basis of the first label text and a second label text is generated, so that the second label text can represent the correct navigation intention, and the navigation result is generated by recognizing the text and the second label text, which is beneficial to improving the recognition effect of mixed languages.
具体地,请结合图2和图5,在图5中,车辆20可包括车载终端21。车辆20可通过车载终端21来获取用户发出的语音请求,并将获取到的语音请求发送给服务器10。在图2中,服务器10可接收车载终端21发送的语音请求。语音请求被传输给处理器12,使得处理器12根据语音请求来最终生成导航结果。服务器10可将导航结果传输给车辆20,车辆20则可通过车载终端21来接收导航结果,并可将导航结果反馈给用户(如通过显示的方式向用户展示,或通过语音播报的方式来告知用户)。Specifically, please refer to FIG. 2 and FIG. 5. In FIG. 5, the vehicle 20 may include an on-board terminal 21. The vehicle 20 may obtain a voice request issued by a user through the on-board terminal 21, and send the obtained voice request to the server 10. In FIG. 2, the server 10 may receive a voice request sent by the on-board terminal 21. The voice request is transmitted to the processor 12, so that the processor 12 finally generates a navigation result according to the voice request. The server 10 may transmit the navigation result to the vehicle 20, and the vehicle 20 may receive the navigation result through the on-board terminal 21, and may feed back the navigation result to the user (such as displaying it to the user, or informing the user through voice broadcast).
一种计算机可读存储介质,其上存储有计算机程序。计算机程序在被处理器执行时,实现上述任一项的语音识别方法的步骤。A computer-readable storage medium stores a computer program, which, when executed by a processor, implements the steps of any of the above-mentioned speech recognition methods.
例如,在计算机程序被执行的情况下,可以实现以下步骤:For example, when the computer program is executed, the following steps may be implemented:
01:获取识别文本,识别文本为对语音请求进行语音识别得到;01: Get the recognized text, which is obtained by performing voice recognition on the voice request;
02:根据预设模型对识别文本进行识别,得到第一标签文本;02: Recognize the recognition text according to the preset model to obtain the first label text;
03:在确定第一标签文本不满足预设条件的情况下,对第一标签文本进行修正处理并生成第二标签文本;03: When it is determined that the first label text does not meet the preset condition, the first label text is corrected and a second label text is generated;
04:根据识别文本和第二标签文本生成导航结果。04: Generate navigation results based on the recognized text and the second label text.
计算机可读存储介质可设置在服务器10,也可设置在其他终端,服务器10能够与其他终端进行通信来获取到相应的程序。The computer-readable storage medium may be provided in the server 10 or in other terminals. The server 10 may communicate with other terminals to obtain the corresponding program.
可以理解,计算机可读存储介质可以包括:能够携带计算机程序的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、以及软件分发介质等。计算机程序包括计算机程序代码。计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。计算机可读存储介质可以包括:能够携带计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、以及软件分发介质。It can be understood that computer-readable storage media may include: any entity or device capable of carrying a computer program, recording media, USB flash drives, mobile hard disks, magnetic disks, optical disks, computer memory, read-only memory (ROM), random access memory (RAM), and software distribution media, etc. A computer program includes computer program code. The computer program code may be in source code form, object code form, executable file, or some intermediate form, etc. A computer-readable storage medium may include: any entity or device capable of carrying a computer program code, recording media, USB flash drives, mobile hard disks, magnetic disks, optical disks, computer memory, read-only memory (ROM), random access memory (RAM), and software distribution media.
流程图中或在此以其他方式描述的任何过程或方法描述可以被理解为,表示包括一个或更多个用于实现特定逻辑功能或过程的步骤的可执行指令的代码的模块、片段或部分,并且本申请的范围包括另外的实现,其中可以不按所示出或讨论的顺序,包括根据所涉及的功能按基本同时的方式或按相反的顺序,来执行功能,这应被本申请的实施例所属技术领域的技术人员所理解。Any process or method description in a flowchart or otherwise described herein may be understood to represent a module, segment or portion of code that includes one or more executable instructions for implementing the steps of a specific logical function or process, and the scope of the present application includes additional implementations in which the functions may not be performed in the order shown or discussed, including performing the functions in a substantially simultaneous manner or in the reverse order depending on the functions involved, which should be understood by technicians in the technical field to which the embodiments of the present application belong.
在流程图中表示或在此以其他方式描述的逻辑和/或步骤,例如,可以被认为是用于实现逻辑功能的可执行指令的定序列表,可以具体实现在任何计算机可读介质中,以供指令执行系统、装置或设备(如基于计算机的系统、包括处理模块的系统或其他可以从指令执行系统、装置或设备取指令并执行指令的系统)使用,或结合这些指令执行系统、装置或设备而使用。The logic and/or steps represented in the flowchart or otherwise described herein, for example, can be considered as an ordered list of executable instructions for implementing logical functions, and can be specifically implemented in any computer-readable medium for use by an instruction execution system, device or apparatus (such as a computer-based system, a system including a processing module, or other system that can fetch instructions from an instruction execution system, device or apparatus and execute instructions), or used in combination with these instruction execution systems, devices or apparatuses.
此外,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个所述特征。在本申请的描述中,“多个”的含义是两个或两个以上,除非另有明确具体的限定。In addition, the terms "first" and "second" are used for descriptive purposes only and should not be understood as indicating or implying relative importance or implicitly indicating the number of the indicated technical features. Thus, the features defined as "first" and "second" may explicitly or implicitly include one or more of the features. In the description of this application, the meaning of "plurality" is two or more, unless otherwise clearly and specifically defined.
尽管已经示出和描述了本申请,本领域的普通技术人员可以理解:在不脱离本申请的原理和宗旨的情况下可以对本申请进行多种变化、修改、替换和变型,本申请的范围由权利要求及其等同物限定。Although the present application has been shown and described, it will be appreciated by those skilled in the art that various changes, modifications, substitutions and variations may be made to the present application without departing from the principles and spirit of the present application, and the scope of the present application is defined by the claims and their equivalents.

Claims (10)

  1. 一种语音识别方法,其中,所述语音识别方法包括:A speech recognition method, wherein the speech recognition method comprises:
    获取识别文本,所述识别文本为对语音请求进行语音识别得到;Acquire a recognition text, where the recognition text is obtained by performing voice recognition on the voice request;
    根据预设模型对所述识别文本进行识别,得到第一标签文本;Recognize the recognition text according to a preset model to obtain a first label text;
    在确定所述第一标签文本不满足预设条件的情况下,对所述第一标签文本进行修正处理并生成第二标签文本;When it is determined that the first label text does not meet the preset condition, the first label text is corrected and a second label text is generated;
    根据所述识别文本和所述第二标签文本生成导航结果。A navigation result is generated according to the recognized text and the second label text.
  2. 根据权利要求1所述的语音识别方法,其中,根据预设模型对所述识别文本进行识别,得到第一标签文本,包括:The speech recognition method according to claim 1, wherein the recognition text is recognized according to a preset model to obtain the first label text, comprising:
    根据所述预设模型,对所述识别文本进行嵌入处理得到嵌入文本;According to the preset model, embedding processing is performed on the recognized text to obtain embedded text;
    根据所述预设模型,对所述识别文本进行编码处理得到编码文本;According to the preset model, encoding the recognized text to obtain an encoded text;
    根据所述嵌入文本和所述编码文本生成所述第一标签文本。The first label text is generated according to the embedded text and the encoded text.
  3. 根据权利要求1所述的语音识别方法,其中,所述语音识别方法包括:The speech recognition method according to claim 1, wherein the speech recognition method comprises:
    确定所述第一标签文本中的导航词标签和兴趣点标签,所述导航词标签对应所述识别文本中的导航词文本,所述兴趣点标签对应所述识别文本中的兴趣点文本;Determine a navigation word label and an interest point label in the first label text, wherein the navigation word label corresponds to the navigation word text in the recognition text, and the interest point label corresponds to the interest point text in the recognition text;
    在所述识别文本中位于所述导航词文本和所述兴趣点文本之间的文本部分未对应所述导航词标签或所述兴趣点标签的任意一个的情况下,确定所述第一标签文本不满足预设条件。When a text portion between the navigation word text and the point of interest text in the recognized text does not correspond to any one of the navigation word label or the point of interest label, it is determined that the first label text does not meet the preset condition.
  4. 根据权利要求3所述的语音识别方法,其中,所述语音识别方法包括:The speech recognition method according to claim 3, wherein the speech recognition method comprises:
    在所述识别文本中位于所述导航词文本和所述兴趣点文本之间的文本部分对应所述导航词标签或所述兴趣点标签的情况下,根据所述识别文本和所述第一标签文本生成所述导航结果。In the case where a text portion between the navigation word text and the interest point text in the recognition text corresponds to the navigation word label or the interest point label, the navigation result is generated according to the recognition text and the first label text.
  5. 根据权利要求3所述的语音识别方法,其中,对所述第一标签文本进行修正处理并生成第二标签文本,包括:The speech recognition method according to claim 3, wherein the step of performing correction processing on the first label text and generating the second label text comprises:
    根据所述文本部分和所述兴趣点文本,得到至少两个兴趣点组合;Obtain at least two interest point combinations according to the text portion and the interest point text;
    根据预设词表和词频特征,确定每个兴趣点组合的得分,所述识别文本的每个词都具有对应的所述词频特征;Determine the score of each interest point combination according to a preset vocabulary and word frequency features, each word of the recognition text having a corresponding word frequency feature;
    将兴趣点标签修正为对应具有最高得分的一个兴趣点组合,根据所述导航词标签和修正后的兴趣点标签生成所述第二标签文本。The interest point label is modified to correspond to an interest point combination with the highest score, and the second label text is generated according to the navigation word label and the modified interest point label.
  6. 根据权利要求1所述的语音识别方法,其中,所述语音识别方法包括:The speech recognition method according to claim 1, wherein the speech recognition method comprises:
    确定所述识别文本中的至少一个兴趣点实体;determining at least one point of interest entity in the recognized text;
    将所述识别文本进行文本拆分得到多个文本片段,每个所述兴趣点实体位于一个所述文本片段内;Splitting the recognized text into a plurality of text segments, each of the interest point entities being located in one of the text segments;
    根据预设的标签树,获取每个所述文本片段所对应的标签,所述标签包括导航词标签和兴趣点标签;According to a preset tag tree, a tag corresponding to each of the text segments is obtained, wherein the tag includes a navigation word tag and a point of interest tag;
    对所述多个文本片段填充对应的标签,其中,在所述兴趣点实体的数量为至少两个的情况下,将其中一个兴趣点实体填充为兴趣点类型标签,将至少一个兴趣点实体填充为兴趣点限名标签,所述兴趣点标签包括所述兴趣点类型标签和兴趣点限名标签,所述兴趣点类型标签和兴趣点限名标签在所述标签数中具有对应的依附关系。Fill the multiple text fragments with corresponding labels, wherein, when the number of the interest point entities is at least two, one of the interest point entities is filled with an interest point type label, and at least one interest point entity is filled with an interest point limited name label, the interest point labels include the interest point type label and the interest point limited name label, and the interest point type label and the interest point limited name label have a corresponding dependency relationship in the number of labels.
  7. 根据权利要求1所述的语音识别方法,其中,所述语音识别方法包括:The speech recognition method according to claim 1, wherein the speech recognition method comprises:
    根据意图词表和标签映射表,生成核心文本和标签的映射关系;Generate the mapping relationship between core text and tags based on the intent vocabulary and tag mapping table;
    对多个所述标签进行语句组合生成多个标签句式,在所述标签句式中,不同的标签具有依附关系;Combining the plurality of labels to generate a plurality of label sentences, wherein different labels have a dependency relationship in the label sentences;
    根据所述多个标签句式构建标签树。A tag tree is constructed according to the multiple tag sentence patterns.
  8. 一种服务器,其中,所述服务器包括存储器和处理器,存储器存储有计算机程序,处理器执行所述计算机程序时,实现权利要求1至7中任一项所述的语音识别方法的步骤。A server, wherein the server comprises a memory and a processor, the memory stores a computer program, and when the processor executes the computer program, the steps of the speech recognition method described in any one of claims 1 to 7 are implemented.
  9. 一种语音识别系统,其中,所述语音识别系统包括服务器和车辆,所述服务器用于:A speech recognition system, wherein the speech recognition system comprises a server and a vehicle, wherein the server is used for:
    接收语音请求;receiving a voice request;
    获取识别文本,所述识别文本为对所述语音请求进行语音识别得到;Acquire a recognition text, where the recognition text is obtained by performing voice recognition on the voice request;
    根据预设模型对所述识别文本进行识别,得到第一标签文本;Recognize the recognition text according to a preset model to obtain a first label text;
    在确定所述第一标签文本不满足预设条件的情况下,对所述第一标签文本进行修正处理并生成第二标签文本;和If it is determined that the first label text does not meet the preset condition, modifying the first label text and generating a second label text; and
    根据所述识别文本和所述第二标签文本生成导航结果;generating a navigation result according to the recognized text and the second label text;
    所述车辆用于:The vehicle is used for:
    发送所述语音请求;和sending the voice request; and
    接收所述导航结果。The navigation result is received.
  10. 一种计算机可读存储介质,其上存储有计算机程序,其中,所述计算机程序在被处理器执行时,实现权利要求1至7中任一项所述的语音识别方法的步骤。A computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of the speech recognition method according to any one of claims 1 to 7.
PCT/CN2023/121063 2022-09-26 2023-09-25 Speech recognition method, and server, speech recognition system and readable storage medium WO2024067471A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211170954.8 2022-09-26
CN202211170954.8A CN115294964B (en) 2022-09-26 2022-09-26 Speech recognition method, server, speech recognition system, and readable storage medium

Publications (1)

Publication Number Publication Date
WO2024067471A1 true WO2024067471A1 (en) 2024-04-04

Family

ID=83833671

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/121063 WO2024067471A1 (en) 2022-09-26 2023-09-25 Speech recognition method, and server, speech recognition system and readable storage medium

Country Status (2)

Country Link
CN (1) CN115294964B (en)
WO (1) WO2024067471A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115294964B (en) * 2022-09-26 2023-02-10 广州小鹏汽车科技有限公司 Speech recognition method, server, speech recognition system, and readable storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20160093763A (en) * 2015-01-29 2016-08-09 주식회사 마이티웍스 Tagging system and method for sound data
CN110674259A (en) * 2019-09-27 2020-01-10 北京百度网讯科技有限公司 Intention understanding method and device
CN112528671A (en) * 2020-12-02 2021-03-19 北京小米松果电子有限公司 Semantic analysis method, semantic analysis device and storage medium
CN113919344A (en) * 2021-09-26 2022-01-11 腾讯科技(深圳)有限公司 Text processing method and device
CN216249242U (en) * 2021-09-26 2022-04-08 绍兴市新闻传媒中心 Intelligent matching system for place name and label of media asset
CN114548200A (en) * 2020-11-10 2022-05-27 国际商业机器公司 Multi-language intent recognition
CN114913856A (en) * 2022-07-11 2022-08-16 广州小鹏汽车科技有限公司 Voice interaction method, server and storage medium
CN115294964A (en) * 2022-09-26 2022-11-04 广州小鹏汽车科技有限公司 Speech recognition method, server, speech recognition system, and readable storage medium

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008064885A (en) * 2006-09-05 2008-03-21 Honda Motor Co Ltd Voice recognition device, voice recognition method and voice recognition program
CN105095186A (en) * 2015-07-28 2015-11-25 百度在线网络技术(北京)有限公司 Semantic analysis method and device
US11410646B1 (en) * 2016-09-29 2022-08-09 Amazon Technologies, Inc. Processing complex utterances for natural language understanding
CN107967250B (en) * 2016-10-19 2020-12-29 中兴通讯股份有限公司 Information processing method and device
CN111709240A (en) * 2020-05-14 2020-09-25 腾讯科技(武汉)有限公司 Entity relationship extraction method, device, equipment and storage medium thereof
CN113536793A (en) * 2020-10-14 2021-10-22 腾讯科技(深圳)有限公司 Entity identification method, device, equipment and storage medium
CN112232074B (en) * 2020-11-13 2022-01-04 完美世界控股集团有限公司 Entity relationship extraction method and device, electronic equipment and storage medium
CN114333795A (en) * 2021-12-23 2022-04-12 科大讯飞股份有限公司 Speech recognition method and apparatus, computer readable storage medium
CN114722825A (en) * 2022-04-06 2022-07-08 平安科技(深圳)有限公司 Label generation method and device, storage medium and computer equipment

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20160093763A (en) * 2015-01-29 2016-08-09 주식회사 마이티웍스 Tagging system and method for sound data
CN110674259A (en) * 2019-09-27 2020-01-10 北京百度网讯科技有限公司 Intention understanding method and device
CN114548200A (en) * 2020-11-10 2022-05-27 国际商业机器公司 Multi-language intent recognition
CN112528671A (en) * 2020-12-02 2021-03-19 北京小米松果电子有限公司 Semantic analysis method, semantic analysis device and storage medium
CN113919344A (en) * 2021-09-26 2022-01-11 腾讯科技(深圳)有限公司 Text processing method and device
CN216249242U (en) * 2021-09-26 2022-04-08 绍兴市新闻传媒中心 Intelligent matching system for place name and label of media asset
CN114913856A (en) * 2022-07-11 2022-08-16 广州小鹏汽车科技有限公司 Voice interaction method, server and storage medium
CN115294964A (en) * 2022-09-26 2022-11-04 广州小鹏汽车科技有限公司 Speech recognition method, server, speech recognition system, and readable storage medium

Also Published As

Publication number Publication date
CN115294964A (en) 2022-11-04
CN115294964B (en) 2023-02-10

Similar Documents

Publication Publication Date Title
US9818401B2 (en) Systems and methods for adaptive proper name entity recognition and understanding
KR101259558B1 (en) apparatus and method for detecting sentence boundaries
US9275633B2 (en) Crowd-sourcing pronunciation corrections in text-to-speech engines
KR102390940B1 (en) Context biasing for speech recognition
US20140172411A1 (en) Apparatus and method for verifying context
CN111062217B (en) Language information processing method and device, storage medium and electronic equipment
US20120310642A1 (en) Automatically creating a mapping between text data and audio data
US9196251B2 (en) Contextual conversion platform for generating prioritized replacement text for spoken content output
WO2024067471A1 (en) Speech recognition method, and server, speech recognition system and readable storage medium
US11328708B2 (en) Speech error-correction method, device and storage medium
AU2022263497A1 (en) Systems and methods for adaptive proper name entity recognition and understanding
WO2012016505A1 (en) File processing method and file processing device
US20220067292A1 (en) Guided text generation for task-oriented dialogue
JP5323652B2 (en) Similar word determination method and system
CN112183073A (en) Text error correction and completion method suitable for legal hot-line speech recognition
CN111353295A (en) Sequence labeling method and device, storage medium and computer equipment
CN111310473A (en) Text error correction method and model training method and device thereof
US8977538B2 (en) Constructing and analyzing a word graph
CN115099222A (en) Punctuation mark misuse detection and correction method, device, equipment and storage medium
WO2022078348A1 (en) Mail content extraction method and apparatus, and electronic device and storage medium
CN115545013A (en) Sound-like error correction method and device for conversation scene
CN114492396A (en) Text error correction method for automobile proper nouns and readable storage medium
CN114398876B (en) Text error correction method and device based on finite state converter
JP7483085B1 (en) Information processing system, information processing device, information processing method, and program
CN116110397B (en) Voice interaction method, server and computer readable storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23870702

Country of ref document: EP

Kind code of ref document: A1