WO2023045186A1 - Procédé et appareil de reconnaissance d'intention, dispositif électronique et support de stockage - Google Patents

Procédé et appareil de reconnaissance d'intention, dispositif électronique et support de stockage Download PDF

Info

Publication number
WO2023045186A1
WO2023045186A1 PCT/CN2022/071216 CN2022071216W WO2023045186A1 WO 2023045186 A1 WO2023045186 A1 WO 2023045186A1 CN 2022071216 W CN2022071216 W CN 2022071216W WO 2023045186 A1 WO2023045186 A1 WO 2023045186A1
Authority
WO
WIPO (PCT)
Prior art keywords
pinyin
text
audio
metatext
features
Prior art date
Application number
PCT/CN2022/071216
Other languages
English (en)
Chinese (zh)
Inventor
孙金辉
李俊杰
马骏
王少军
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2023045186A1 publication Critical patent/WO2023045186A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • G06F40/35Discourse or dialogue representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Definitions

  • the present application relates to the technical field of artificial intelligence, and specifically relates to an intention recognition method, device, electronic equipment and storage medium.
  • Intelligent voice customer service system has been widely used in various industries, such as insurance, banking, telecommunications, e-commerce, etc.
  • intelligent voice customer service communicates with users through voice, which uses automatic speech recognition (Automatic Speech Recognition, ASR), natural language understanding (Natural Language
  • ASR Automatic Speech Recognition
  • Natural Language Natural Language
  • a number of intelligent human-computer interaction technologies can recognize the questions raised by users in the form of voice, understand user intentions through semantic analysis, communicate with users in an anthropomorphic way, and provide users with information consultation and other related services.
  • the core of an intelligent voice customer service session is to identify user intentions and give targeted answers after clarifying user intentions.
  • the main way for the intelligent voice customer service system to identify user intentions is to first convert the user's voice into text through the ASR module, and then input the translated text into the NLU module to identify the user intention.
  • the common practice of the NLU module is to use business annotation data to fine-tune the pre-trained language model.
  • the business annotation data and the pre-trained language model data are both text data, while the online data is the translated text of ASR.
  • the accuracy of intent recognition is low.
  • embodiments of the present application provide an intent recognition method, device, electronic device, and storage medium, which can improve the accuracy of intent recognition while ensuring recognition efficiency.
  • the embodiment of the present application provides an intention recognition method, including:
  • pinyin vector table includes all phonemes in standard pinyin, and each phoneme in all phonemes corresponds to a pinyin vector;
  • an intention identification device including:
  • Conversion module used to obtain text and pinyin text according to the speech to be recognized
  • the feature extraction module is used to input the text into the first neural network model to obtain semantic features and obtain a pinyin vector table, wherein the pinyin vector table includes all phonemes in standard pinyin, and each phoneme in all phonemes corresponds to a The pinyin vector, and according to the pinyin text, is matched in the pinyin vector table to obtain the phonetic features;
  • the fusion module is used to fuse semantic features and phonetic features to obtain fusion features
  • the recognition module is used to input the fusion feature into the intention recognition model to obtain the intention recognition result of the speech to be recognized.
  • an embodiment of the present application provides an electronic device, including: a processor, the processor is connected to a memory, the memory is used to store a computer program, and the processor is used to execute the computer program stored in the memory, so that the electronic device performs as described in A method in one aspect, the method comprising:
  • pinyin vector table wherein, include all phonemes in standard pinyin in the pinyin vector table, and each phoneme in described all phonemes all corresponds to a pinyin vector;
  • an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and the computer program causes the computer to execute the method according to the first aspect, the method comprising:
  • pinyin vector table includes all phonemes in standard pinyin, and each phoneme in all phonemes corresponds to a pinyin vector;
  • the present application adds phonetic features representing pronunciation characteristics, so that intent recognition no longer depends solely on text, and improves the accuracy of intent recognition.
  • the phonetic feature vectors corresponding to each pinyin can be obtained through pre-training methods to form a specific phonetic vector table. Therefore, in actual use, the speech features can be obtained by querying the speech vector table, which will not cause additional calculations and will not affect the timeliness of the model, thus ensuring the efficiency of intent recognition.
  • FIG. 1 is a schematic diagram of a hardware structure of an intention recognition device provided in an embodiment of the present application
  • FIG. 2 is a schematic flowchart of an intent recognition method provided in an embodiment of the present application
  • FIG. 3 is a schematic flow diagram of a speech recognition method for different dialects provided in an embodiment of the present application.
  • FIG. 4 is a schematic flow diagram of a method for obtaining literal text and pinyin text according to a standard voice provided by an embodiment of the present application;
  • FIG. 5 is a schematic flow diagram of a method for extracting features of standard speech and obtaining audio features provided in an embodiment of the present application
  • Fig. 6 is a schematic flow chart of a method for matching in a pinyin vector table to obtain phonetic features according to a pinyin text provided in an embodiment of the present application;
  • FIG. 7 is a block diagram of functional modules of an intention recognition device provided in an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.
  • an embodiment means that a particular feature, result, or characteristic described in connection with the embodiment can be included in at least one embodiment of the present application.
  • the occurrences of this phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are independent or alternative embodiments mutually exclusive of other embodiments. It is understood explicitly and implicitly by those skilled in the art that the embodiments described herein can be combined with other embodiments.
  • FIG. 1 is a schematic diagram of a hardware structure of an intention recognition device provided in an embodiment of the present application.
  • the intention identification device 100 includes at least one processor 101 , a communication line 102 , a memory 103 and at least one communication interface 104 .
  • the processor 101 may be a general-purpose central processing unit (central processing unit, CPU), a microprocessor, a specific application integrated circuit (application-specific integrated circuit, ASIC), or one or more An integrated circuit that controls the program execution of the program of this application.
  • CPU central processing unit
  • ASIC application-specific integrated circuit
  • Communication line 102 which may include a path, transmits information between the aforementioned components.
  • the communication interface 104 may be any device such as a transceiver (such as an antenna) for communicating with other devices or communication networks, such as Ethernet, RAN, wireless local area networks (wireless local area networks, WLAN) and the like.
  • a transceiver such as an antenna
  • WLAN wireless local area networks
  • Memory 103 may be read-only memory (read-only memory, ROM) or other types of static storage devices that can store static information and instructions, random access memory (random access memory, RAM) or other types that can store information and instructions
  • Type of dynamic storage device also can be electrically erasable programmable read-only memory (electrically erasable programmable read-only memory, EEPROM), read-only disc (compact disc read-only memory, CD-ROM) or other optical disc storage, optical disc storage (including compact discs, laser discs, optical discs, digital versatile discs, blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or can be used to carry or store desired program code in the form of instructions or data structures and can be used by Any other medium accessed by a computer, but not limited to.
  • the memory 103 may exist independently and be connected to the processor 101 through the communication line 102 .
  • the memory 103 can also be integrated with the processor 101 .
  • the memory 103 provided in this embodiment of the present application may generally be non-volatile.
  • the memory 103 is used to store computer-executed instructions for implementing the solutions of the present application, and the execution is controlled by the processor 101 .
  • the processor 101 is configured to execute computer-executed instructions stored in the memory 103, so as to implement the methods provided in the following embodiments of the present application.
  • computer-executed instructions may also be referred to as application code, which is not specifically limited in the present application.
  • the processor 101 may include one or more CPUs, such as CPU0 and CPU1 in FIG. 1 .
  • the intent recognition apparatus 100 may include multiple processors, for example, the processor 101 and the processor 107 in FIG. 1 .
  • Each of these processors may be a single-core (single-CPU) processor or a multi-core (multi-CPU) processor.
  • a processor herein may refer to one or more devices, circuits, and/or processing cores for processing data (eg, computer program instructions).
  • the intention identification device 100 is a server, for example, it may be an independent server, or it may provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, intermediate Cloud servers for basic cloud computing services such as mail service, domain name service, security service, content delivery network (ContentDelivery Network, CDN), and big data and artificial intelligence platforms.
  • the intent recognition apparatus 100 may further include an output device 105 and an input device 106 .
  • Output device 105 is in communication with processor 101 and may display information in a variety of ways.
  • the output device 105 may be a liquid crystal display (liquid crystal display, LCD), a light emitting diode (light emitting diode, LED) display device, a cathode ray tube (cathode ray tube, CRT) display device, or a projector (projector) wait.
  • the input device 106 communicates with the processor 101 and can receive user input in various ways.
  • the input device 106 may be a mouse, a keyboard, a touch screen device, or a sensing device, among others.
  • the above-mentioned intention recognition apparatus 100 may be a general-purpose device or a special-purpose device.
  • the embodiment of the present application does not limit the type of the intention identification device 100.
  • AI artificial intelligence
  • digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • Artificial intelligence basic technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometrics technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • FIG. 2 is a schematic flowchart of an intent recognition method provided in an embodiment of the present application.
  • the intent recognition method includes the following steps:
  • the voice to be recognized may be voice information input by the user, and specifically may be expressed as a sentence spoken by the user.
  • various dialects have been formed.
  • the dialect is only spoken in a certain area, it also has a complete system. All dialects have a phonetic structure system, a lexical structure system and a grammatical structure system, which can meet the needs of social communication in the region.
  • the various local dialects of the same nation are the common language of this nation, and generally always show the linguistic characteristics of "differences in the sameness, similarities in the differences". Under normal circumstances, the national common language is always developed on the basis of a dialect.
  • dialects can be divided into regional dialects and social dialects.
  • Regional dialects are variants of languages formed due to regional differences.
  • Social dialects are different social variants formed by members of society in the same region due to social differences in occupation, class, age, gender, and cultural upbringing.
  • dialect also has the characteristics of wide range of use and many types.
  • this application proposes a speech recognition method for different dialects, which can accurately convert various dialects into written text and pinyin text, and improves the scope of application of intent recognition.
  • the method includes:
  • an acoustic model can be pre-trained, for example: a multi-layer long-short-term memory network, a multi-layer convolutional neural network, and the like.
  • the acoustic feature may include a feature sequence of the speech to be recognized, a posterior probability distribution of phonemes in the speech to be recognized, and an acoustic vector of the speech to be recognized.
  • the output of the low-level network in the acoustic model can be used as the feature sequence of the speech to be recognized, and the output of the high-level network can be used as the acoustic vector of the speech to be recognized.
  • the posterior probability distribution of the phonemes in the speech to be recognized refers to the probability that each phoneme in the speech to be recognized is recognized as a different phoneme.
  • the acoustic features of the speech to be recognized are acquired, the acoustic features can be compared with various acoustic features 1 stored in the dialect feature database, for example, the similarity can be calculated. Thereby, the dialect category of the speech to be recognized is determined.
  • the audio transposition formula is used to identify the conversion features between the corresponding dialect pronunciation and Mandarin pronunciation. Specifically, through the dialect transposition formula, the dialect speech can be converted into the corresponding Mandarin speech, that is, the standard speech mentioned in this application.
  • the differences and rules between different dialects relative to Mandarin can be determined through training, such as: differences and rules in pronunciation, differences in tone and Laws, corresponding relations of proprietary vocabulary, etc., and then form an audio transposition formula for converting different dialects into Mandarin.
  • a method for obtaining literal text and pinyin text according to standard speech is proposed, as shown in FIG. 4 , the method includes:
  • audio feature extraction may include spectral conversion, nonlinear spectral conversion, and feature coefficient conversion.
  • the audio feature can be the degree mapping under the auditory critical frequency band scale corresponding to the standard speech, for example: the mapping of the standard speech in the Bark domain, and the equivalent rectangular bandwidth (Equivalent Rectangular Bandwidth, ERB) domain of the standard speech
  • ERB Equivalent Rectangular Bandwidth
  • the present application proposes a method for extracting features of standard speech and obtaining audio features, as shown in Figure 5, the method includes:
  • a pinyin syllable refers to a phonetic unit that is pronounced by a combination of phonemes (including consonants and vowels), and a syllable in a pinyin language is pronounced by a combination of consonants and vowels.
  • the syllables included in the standard speech may be determined by means of table lookup, and the standard speech may be split and processed according to the syllables to obtain at least one sub-audio.
  • the at least one audio spectrum corresponds to at least one sub-audio in a one-to-one manner.
  • FFT fast Fourier transform
  • the at least one nonlinear spectrum corresponds to at least one audio spectrum.
  • the audio frequency spectrum expressed in a linear manner may be converted into a nonlinear frequency spectrum, so as to further highlight the sound features in the standard speech.
  • the frequency conversion formula can be expressed by formula 1:
  • F hz is the frequency value of each audio frequency spectrum.
  • the frequency conversion formula can be expressed by formula 2:
  • the frequency conversion formula can be expressed by formula 3:
  • 504 Perform discrete cosine transformation on each nonlinear frequency spectrum in the at least one nonlinear frequency spectrum to obtain at least one characteristic coefficient.
  • the at least one characteristic coefficient is in one-to-one correspondence with at least one nonlinear frequency spectrum.
  • discrete cosine transformation can be performed on each nonlinear spectrum obtained, and the second to sixteenth coefficients of each nonlinear spectrum obtained after discrete cosine transformation can be combined to obtain The characteristic coefficients corresponding to the spectrum.
  • each nonlinear frequency spectrum can be correspondingly obtained with a 15-dimensional feature coefficient.
  • 505 According to the correspondence between at least one feature coefficient and at least one sub-audio, arrange at least one feature coefficient according to the positional relationship of the at least one sub-audio in the standard speech to obtain audio features.
  • sub-audio streams can be obtained: sub-audio 1, sub-audio 2, sub-audio 3 and sub-audio 4, that is, the sequence of each sub-audio in the standard voice is: sub-audio 1, sub-audio Sub Audio 2, Sub Audio 3, Sub Audio 4.
  • the feature coefficient corresponding to sub-audio 1 is feature coefficient A
  • the feature coefficient corresponding to sub-audio 2 is feature coefficient B
  • the feature coefficient corresponding to sub-audio 3 is feature coefficient C
  • the feature coefficient corresponding to sub-audio 4 is feature Coefficient D.
  • each feature coefficient is arranged according to the order of its corresponding sub-audio in the standard speech, that is, the order of sub-audio 1, sub-audio 2, sub-audio 3, and sub-audio 4 can be obtained.
  • the pinyin text is composed of at least one first pinyin metatext
  • the first pinyin metatext refers to any initial consonant or final.
  • the second neural network is a neural network composed of at least one second pinyin metatext
  • each second pinyin metatext in the at least one second pinyin metatext corresponds to at least one character and one standard audio feature.
  • 403 Perform matching in the second neural network according to each of the first pinyin metatexts in the at least one first pinyin metatext in the pinyin text, to obtain at least one first character.
  • the second neural network is a neural network composed of at least one second pinyin metatext, and each second pinyin metatext in at least one second pinyin metatext corresponds to at least one character and a standard audio feature .
  • the third pinyin metatext can be determined in the second neural network according to each first pinyin metatext.
  • the third pinyin metatext is any one of at least one second pinyin metatext, and the third pinyin metatext is the same as each first pinyin metatext.
  • at least one character corresponding to the third pinyin metatext is obtained, and the speech scene of the standard speech is determined, and a common dictionary corresponding to the speech scene is obtained.
  • the commonly used dictionary among at least one character, the first character is determined.
  • the second pinyin metatext "dai” may correspond to five characters in total: “wai”, “dai”, “dai”, “dai” and “dai”.
  • a common dictionary for this scene is acquired.
  • the commonly used characters corresponding to the second pinyin metatext "dai” in this scene are " ⁇ ” and " ⁇ ”. Based on this, among the five characters corresponding to the second pinyin metatext "dai", the character “ ⁇ ” and “to be” have the highest degree of fit with the bank-business processing scenario, and the characters "credit” and "to be” can be used as candidate characters.
  • the first neural network may be a Bert neural network.
  • each character in the text can be semantically identified through the Bert neural network to obtain the semantic vector R z corresponding to each character, and then the semantic vector R z corresponding to each character can be calculated according to each character in the text By combining the positions in the text, the semantic features of the text can be obtained.
  • the pinyin vector table includes all phonemes in standard pinyin, and each phoneme in all phonemes corresponds to a pinyin vector.
  • the general data set can be converted into a pinyin data set, wherein the pinyin data set includes all phonemes in the standard pinyin.
  • input the pinyin data set into the audio prediction network Tacotron2 use the coding layer of Tacotron2 to get the pinyin vector corresponding to each phoneme and save it as the pre-training pinyin vector table P.
  • the method includes:
  • 601 Divide the pinyin text into at least one first phoneme.
  • 602 Perform matching in the pinyin vector table according to each first phoneme in the at least one first phoneme, to obtain at least one first pinyin vector.
  • each first phoneme is looked up from the saved pre-trained pinyin vector table P to obtain the speech feature vector R p corresponding to each first phoneme.
  • 603 Concatenate at least one first pinyin vector according to the position of each first phoneme in the at least one first phoneme in the pinyin text to obtain phonetic features.
  • the speech features can be obtained by querying the speech vector table, which will not cause additional computation and will not affect the timeliness of the model, thus ensuring the efficiency of intent recognition.
  • W 1 and W 2 are trainable parameter matrices.
  • the fusion feature vector R can be passed through the fully connected layer and softmax to obtain the final output intent recognition result.
  • the text and pinyin text corresponding to the speech are obtained according to the speech to be recognized, and then the semantic features of the speech are extracted according to the text, and obtained by querying the pinyin vector table
  • the phonetic characteristics of the voice is used as input for intent recognition. Therefore, compared with the existing intent recognition methods, the present application adds phonetic features representing pronunciation characteristics on the basis of using semantic features, so that intent recognition no longer depends only on text, and improves the accuracy of intent recognition.
  • the phonetic feature vectors corresponding to each pinyin can be obtained through pre-training methods to form a specific phonetic vector table. Therefore, in actual use, the speech features can be obtained by querying the speech vector table, which will not cause additional calculations and will not affect the timeliness of the model, thus ensuring the efficiency of intent recognition.
  • FIG. 7 is a block diagram of functional modules of an intention recognition device provided in an embodiment of the present application.
  • the intention identification device 700 includes:
  • Transformation module 701 is used for according to the speech to be recognized, obtains literal text and pinyin text;
  • the feature extraction module 702 is used to input the text into the first neural network model to obtain semantic features and obtain a pinyin vector table, wherein the pinyin vector table includes all phonemes in standard pinyin, and each phoneme in all phonemes corresponds to A pinyin vector, and according to the pinyin text, match in the pinyin vector table to obtain the phonetic features;
  • the fusion module 703 is used to fuse semantic features and phonetic features to obtain fusion features
  • the recognition module 704 is configured to input the fusion feature into the intention recognition model to obtain the intention recognition result of the speech to be recognized.
  • the conversion module 701 is specifically used for:
  • the audio transposition formula corresponding to the dialect category, and convert the speech to be recognized into a standard speech through the audio transposition formula, wherein the audio transposition formula is used to identify the conversion characteristics between the corresponding dialect pronunciation and Mandarin pronunciation;
  • the conversion module 701 is specifically used for:
  • Matching is performed in the preset second neural network according to the audio features to obtain a pinyin text that matches the audio features, wherein the pinyin text is composed of at least one first pinyin metatext, and the first pinyin metatext refers to any initial consonant or final ;
  • each first pinyin metatext in at least one first pinyin metatext in the pinyin text match in the second neural network to obtain at least one first character, wherein at least one first character and at least one first One-to-one correspondence between Pinyin metatexts;
  • the at least one first character is arranged according to the arrangement order of the at least one first pinyin metatext in the pinyin text to obtain the text.
  • the conversion module 701 in performing feature extraction on standard speech and obtaining audio features, is specifically used for:
  • Discrete cosine transformation is performed on each nonlinear spectrum in the at least one nonlinear spectrum respectively to obtain at least one characteristic coefficient, wherein at least one characteristic coefficient corresponds to at least one nonlinear spectrum in a one-to-one manner;
  • the at least one feature coefficient is arranged according to the positional relationship of the at least one sub-audio in the standard speech to obtain audio features.
  • the second neural network is a neural network composed of at least one second pinyin metatext, and each second pinyin metatext in the at least one second pinyin metatext corresponds to at least one character and a standard audio characteristics.
  • the conversion module 701 is specifically used for:
  • each first pinyin metatext in the second neural network, determine the 3rd pinyin metatext, wherein, the 3rd pinyin metatext is any one in at least one second pinyin metatext, and the 3rd pinyin metatext and Each first pinyin metatext is the same;
  • a first character is determined according to a commonly used dictionary.
  • the feature extraction module 702 is specifically used for:
  • the feature extraction module 702 is specifically used for:
  • each first phoneme in the at least one first phoneme perform matching in the pinyin vector table to obtain at least one first pinyin vector, wherein at least one first pinyin vector is in one-to-one correspondence with at least one first phoneme;
  • FIG. 8 is a schematic structural diagram of an electronic device provided in an embodiment of the present application, and the electronic device 800 is set in a user terminal.
  • an electronic device 800 includes a transceiver 801 , a processor 802 and a memory 803 . They are connected through a bus 804 .
  • the memory 803 is used to store computer programs and data, and can transmit the data stored in the memory 803 to the processor 802 .
  • the processor 802 is used to read the computer program in the memory 803 to perform the following operations:
  • pinyin vector table includes all phonemes in standard pinyin, and each phoneme in all phonemes corresponds to a pinyin vector;
  • the processor 802 is specifically configured to perform the following operations:
  • the audio transposition formula corresponding to the dialect category, and convert the speech to be recognized into a standard speech through the audio transposition formula, wherein the audio transposition formula is used to identify the conversion characteristics between the corresponding dialect pronunciation and Mandarin pronunciation;
  • the processor 802 is specifically configured to perform the following operations:
  • Matching is performed in the preset second neural network according to the audio features to obtain a pinyin text that matches the audio features, wherein the pinyin text is composed of at least one first pinyin metatext, and the first pinyin metatext refers to any initial consonant or final ;
  • each first pinyin metatext in at least one first pinyin metatext in the pinyin text match in the second neural network to obtain at least one first character, wherein at least one first character and at least one first One-to-one correspondence between Pinyin metatexts;
  • the at least one first character is arranged according to the arrangement order of the at least one first pinyin metatext in the pinyin text to obtain the text.
  • the processor 802 is specifically configured to perform the following operations:
  • Discrete cosine transformation is performed on each nonlinear spectrum in the at least one nonlinear spectrum respectively to obtain at least one characteristic coefficient, wherein at least one characteristic coefficient corresponds to at least one nonlinear spectrum in a one-to-one manner;
  • the at least one feature coefficient is arranged according to the positional relationship of the at least one sub-audio in the standard speech to obtain audio features.
  • the second neural network is a neural network composed of at least one second pinyin metatext, and each second pinyin metatext in the at least one second pinyin metatext corresponds to at least one character and a standard audio characteristics.
  • the processor 802 is specifically configured to perform the following operations:
  • each first pinyin metatext in the second neural network, determine the 3rd pinyin metatext, wherein, the 3rd pinyin metatext is any one in at least one second pinyin metatext, and the 3rd pinyin metatext and Each first pinyin metatext is the same;
  • a first character is determined according to a commonly used dictionary.
  • the processor 802 in terms of obtaining the pinyin vector table, is specifically configured to perform the following operations:
  • the processor 802 is specifically configured to perform the following operations:
  • each first phoneme in the at least one first phoneme perform matching in the pinyin vector table to obtain at least one first pinyin vector, wherein at least one first pinyin vector is in one-to-one correspondence with at least one first phoneme;
  • the intent identification device in this application may include smart phones (such as Android phones, iOS phones, Windows Phone phones, etc.), tablet computers, palmtop computers, notebook computers, mobile Internet devices MID (Mobile Internet Devices, referred to as: MID) , robots or wearable devices, etc.
  • MID Mobile Internet Devices, referred to as: MID
  • the above-mentioned intention recognition device is only an example, not exhaustive, including but not limited to the above-mentioned intention recognition device. In practical applications, the above-mentioned intention identification device may also include: an intelligent vehicle-mounted terminal, a computer device, and the like.
  • the embodiments of the present application also provide a computer-readable storage medium, the computer-readable storage medium stores a computer program, and the computer program is executed by a processor to achieve any of the purposes described in the above-mentioned method embodiments Identify some or all steps of the method.
  • the storage medium may include a hard disk, a floppy disk, an optical disk, a magnetic tape, a magnetic disk, a flash memory, and the like.
  • the storage medium involved in this application such as a computer-readable storage medium, may be non-volatile or volatile.
  • the embodiments of the present application also provide a computer program product, the computer program product includes a non-transitory computer-readable storage medium storing a computer program, and the computer program is operable to cause the computer to execute the method as described in the above-mentioned method embodiment. Some or all steps of any intent recognition method.
  • the disclosed device may be implemented in other ways.
  • the device implementation described above is only illustrative.
  • the division of the units is only a logical function division.
  • multiple units or components can be combined or can be Integrate into another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented not only in the form of hardware, but also in the form of software program modules.
  • the integrated units may be stored in a computer-readable memory if implemented in the form of a software program module and sold or used as an independent product.
  • the technical solution of the present application is essentially or part of the contribution to the prior art, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a memory.
  • a computer device which may be a personal computer, a server, or a network device, etc.
  • the aforementioned memory includes: various media that can store program codes such as U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Machine Translation (AREA)

Abstract

La présente invention se rapporte au domaine technique de l'intelligence artificielle. Spécifiquement, la présente invention concerne un procédé et un appareil de reconnaissance d'intention, et un dispositif électronique ainsi qu'un support de stockage. Le procédé de reconnaissance d'intention consiste à : acquérir un texte en caractères et un texte en pinyin en fonction d'un énoncé à soumettre à une reconnaissance; entrer le texte en caractères dans un premier modèle de réseau neuronal de façon à obtenir une caractéristique sémantique; acquérir une table de vecteurs pinyin, la table de vecteurs pinyin comprenant tous les phonèmes du pinyin standard, et chaque phonème parmi tous les phonèmes correspondant à un vecteur pinyin; effectuer une mise en correspondance dans la table de vecteurs pinyin en fonction du texte en pinyin, de façon à obtenir une caractéristique d'énoncé; fusionner la caractéristique sémantique et la caractéristique d'énoncé de façon à obtenir une caractéristique fusionnée; et entrer la caractéristique fusionnée dans un modèle de reconnaissance d'intention, de façon à obtenir un résultat de reconnaissance d'intention de l'énoncé à soumettre à une reconnaissance.
PCT/CN2022/071216 2021-09-23 2022-01-11 Procédé et appareil de reconnaissance d'intention, dispositif électronique et support de stockage WO2023045186A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111119458.5 2021-09-23
CN202111119458.5A CN113836945B (zh) 2021-09-23 2021-09-23 意图识别方法、装置、电子设备和存储介质

Publications (1)

Publication Number Publication Date
WO2023045186A1 true WO2023045186A1 (fr) 2023-03-30

Family

ID=78969671

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/071216 WO2023045186A1 (fr) 2021-09-23 2022-01-11 Procédé et appareil de reconnaissance d'intention, dispositif électronique et support de stockage

Country Status (2)

Country Link
CN (1) CN113836945B (fr)
WO (1) WO2023045186A1 (fr)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111508479B (zh) * 2020-04-16 2022-11-22 重庆农村商业银行股份有限公司 一种语音识别方法、装置、设备及存储介质
CN113836945B (zh) * 2021-09-23 2024-04-16 平安科技(深圳)有限公司 意图识别方法、装置、电子设备和存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150112675A1 (en) * 2013-10-18 2015-04-23 Via Technologies, Inc. Speech recognition method and electronic apparatus
CN107248409A (zh) * 2017-05-23 2017-10-13 四川欣意迈科技有限公司 一种方言语境的多语言翻译方法
CN111986653A (zh) * 2020-08-06 2020-11-24 杭州海康威视数字技术股份有限公司 一种语音意图识别方法、装置及设备
CN113192497A (zh) * 2021-04-28 2021-07-30 平安科技(深圳)有限公司 基于自然语言处理的语音识别方法、装置、设备及介质
CN113836945A (zh) * 2021-09-23 2021-12-24 平安科技(深圳)有限公司 意图识别方法、装置、电子设备和存储介质

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111259625B (zh) * 2020-01-16 2023-06-27 平安科技(深圳)有限公司 意图识别方法、装置、设备及计算机可读存储介质
CN111554297B (zh) * 2020-05-15 2023-08-22 阿波罗智联(北京)科技有限公司 语音识别方法、装置、设备及可读存储介质
CN113284499A (zh) * 2021-05-24 2021-08-20 湖北亿咖通科技有限公司 一种语音指令识别方法及电子设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150112675A1 (en) * 2013-10-18 2015-04-23 Via Technologies, Inc. Speech recognition method and electronic apparatus
CN107248409A (zh) * 2017-05-23 2017-10-13 四川欣意迈科技有限公司 一种方言语境的多语言翻译方法
CN111986653A (zh) * 2020-08-06 2020-11-24 杭州海康威视数字技术股份有限公司 一种语音意图识别方法、装置及设备
CN113192497A (zh) * 2021-04-28 2021-07-30 平安科技(深圳)有限公司 基于自然语言处理的语音识别方法、装置、设备及介质
CN113836945A (zh) * 2021-09-23 2021-12-24 平安科技(深圳)有限公司 意图识别方法、装置、电子设备和存储介质

Also Published As

Publication number Publication date
CN113836945A (zh) 2021-12-24
CN113836945B (zh) 2024-04-16

Similar Documents

Publication Publication Date Title
US11481562B2 (en) Method and apparatus for evaluating translation quality
US11942082B2 (en) Facilitating communications with automated assistants in multiple languages
US10176804B2 (en) Analyzing textual data
US11862143B2 (en) Systems and methods for processing speech dialogues
CN113205817B (zh) 语音语义识别方法、系统、设备及介质
US9805718B2 (en) Clarifying natural language input using targeted questions
US20140350934A1 (en) Systems and Methods for Voice Identification
WO2016048350A1 (fr) Amélioration de reconnaissance vocale automatique d'entités nommées multilingues
WO2023045186A1 (fr) Procédé et appareil de reconnaissance d'intention, dispositif électronique et support de stockage
JP7335300B2 (ja) 知識事前訓練モデルの訓練方法、装置及び電子機器
CN110852075B (zh) 自动添加标点符号的语音转写方法、装置及可读存储介质
US11907665B2 (en) Method and system for processing user inputs using natural language processing
US20150178274A1 (en) Speech translation apparatus and speech translation method
CN112185361B (zh) 一种语音识别模型训练方法、装置、电子设备及存储介质
CN113254613A (zh) 对话问答方法、装置、设备及存储介质
CN115394321A (zh) 音频情感识别方法、装置、设备、存储介质及产品
Rajendran et al. A robust syllable centric pronunciation model for Tamil text to speech synthesizer
Coto‐Solano Computational sociophonetics using automatic speech recognition
JP7349523B2 (ja) 音声認識方法、音声認識装置、電子機器、記憶媒体コンピュータプログラム製品及びコンピュータプログラム
CN114528851A (zh) 回复语句确定方法、装置、电子设备和存储介质
CN115019787A (zh) 一种交互式同音异义词消歧方法、系统、电子设备和存储介质
Mittal et al. Speaker-independent automatic speech recognition system for mobile phone applications in Punjabi
CN111489742B (zh) 声学模型训练方法、语音识别方法、装置及电子设备
Celikkaya et al. A mobile assistant for Turkish
CN112988965B (zh) 文本数据处理方法、装置、存储介质及计算机设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22871254

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE