WO2020233363A1 - 语音识别的方法、装置、电子设备和存储介质 - Google Patents

语音识别的方法、装置、电子设备和存储介质 Download PDF

Info

Publication number
WO2020233363A1
WO2020233363A1 PCT/CN2020/087471 CN2020087471W WO2020233363A1 WO 2020233363 A1 WO2020233363 A1 WO 2020233363A1 CN 2020087471 W CN2020087471 W CN 2020087471W WO 2020233363 A1 WO2020233363 A1 WO 2020233363A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
voice
text
machine learning
user terminal
Prior art date
Application number
PCT/CN2020/087471
Other languages
English (en)
French (fr)
Inventor
曹绪文
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2020233363A1 publication Critical patent/WO2020233363A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Definitions

  • This application relates to the field of artificial intelligence biometrics, in particular to a method, device, electronic device and storage medium for speech recognition.
  • the commonly used voice recognition method is to extract the characteristics of the voice information to be recognized by the user, and then recognize the voice information to be recognized by the user according to the recognition algorithm.
  • the inventor realizes that in some scenes (such as the road), the voice to be recognized captured by the voice recognition function contains not only the voice of a person, but also noises such as car whistle.
  • the voice information of the person is valid.
  • the voice to be recognized, and the noise such as car whistle sound is noise, and the noise such as car whistle will also be recognized during speech recognition, which leads to insufficient speech recognition accuracy.
  • the purpose of the embodiments of the present application is to provide a voice recognition method, device, computer readable medium, and electronic equipment, which can overcome the problem of low voice recognition accuracy in the prior art at least to a certain extent.
  • a voice recognition method including: obtaining location information of a user terminal; determining the scene information in which the user terminal is located based on the location information; if the user's voice is detected Information, the voice information is recognized as text information; the text information and the scene information are input into the first machine learning model, and the text information output by the first machine learning model optimized according to the scene information is obtained .
  • a speech recognition device including: a first acquisition module: used to acquire location information of a user terminal; a determination module: used to determine where the user terminal is located based on the location information Recognition module: used to recognize the voice information as text information if the user’s voice information is detected; second acquisition module: used to input the text information and the scene information into the first machine learning
  • the model obtains text information optimized according to the scene information output by the first machine learning model.
  • an electronic device for speech recognition including: a memory configured to store executable instructions; a processor configured to execute the executable instructions stored in the memory to perform a voice recognition method:
  • the voice recognition method includes:
  • the text information and the scene information are input into a first machine learning model, and the text information output by the first machine learning model optimized according to the scene information is obtained.
  • a computer-readable storage medium which stores computer program instructions, and when the computer instructions are executed by the computer, the computer executes the voice recognition method:
  • the voice recognition method includes:
  • the text information and the scene information are input into a first machine learning model, and the text information output by the first machine learning model optimized according to the scene information is obtained.
  • the location of the user terminal is determined based on the location information, and the location of the user terminal is compared with the location pre-stored in the database.
  • the scene correspondence table determines the scene information in which the user terminal device is located, and when the user terminal detects that voice information is input, the voice information is recognized as text information, and the text information is combined with the scene information Input the preset first machine learning model, and the first machine learning model filters out the text information corresponding to the noise corresponding to the scene information contained in the text information to obtain optimized text information. It can be seen that the embodiments of the present application can quickly and accurately filter out the text information corresponding to the noise corresponding to the scene contained in the text information corresponding to the speech to be recognized; thereby improving the accuracy of speech recognition.
  • Fig. 1 shows a system architecture diagram of a use environment of a voice recognition method according to an exemplary embodiment of the present application.
  • Fig. 2 shows a flowchart of a voice recognition method according to an exemplary embodiment of the present disclosure.
  • Fig. 3 shows a detailed flowchart of determining the scene information in which the user terminal is located based on the location information according to an exemplary embodiment of the present disclosure.
  • Fig. 4 shows a flowchart before recognizing the voice information of the user as text information if the voice information of the user is detected according to an exemplary embodiment of the present disclosure.
  • FIG. 5 shows a detailed flowchart of recognizing the voice information as text information if the voice information of the user is detected according to an exemplary embodiment of the present disclosure.
  • Fig. 6 shows a flowchart before inputting the text information and the scene information into a first machine learning model to obtain optimized text information output by the first machine learning model according to an exemplary embodiment of the present disclosure.
  • Fig. 7 shows a flowchart after obtaining text information optimized according to the scene information output by the first machine learning model according to an exemplary embodiment of the present disclosure.
  • Fig. 8 shows a flowchart before inputting the optimized text information and the scene information into a preset intention recognition model according to an exemplary embodiment of the present disclosure.
  • Figure 9 shows a structural block diagram of a speech recognition device according to an exemplary embodiment of the present disclosure
  • Fig. 10 shows a diagram of an electronic device for voice recognition according to an exemplary embodiment of the present disclosure.
  • Fig. 11 shows a computer-readable storage medium diagram for speech recognition according to an exemplary embodiment of the present disclosure.
  • FIG. 1 shows a framework diagram of a use environment of a speech recognition method according to an exemplary embodiment of the present disclosure:
  • the use environment includes a user terminal 110, a server 120, and a database 130.
  • the numbers of user terminals, servers, and databases in FIG. 1 are merely illustrative. According to implementation needs, there can be any number of user terminals, servers and databases.
  • the server 120 may be a server cluster composed of multiple servers.
  • the server 120 obtains the location information of the user terminal 110 through the global positioning system (GPS) built into the user terminal 110.
  • the server 120 determines the location of the user terminal 110 based on the location information of the user terminal 110.
  • 120 Determine the scene information corresponding to the user terminal 110 based on the location of the user terminal 110 and the corresponding relationship between the location and the scene information pre-stored in the database 130.
  • the user terminal 110 detects a user voice input, the user terminal 110 will obtain the user voice information
  • the server 120 recognizes the voice information sent by the user terminal 110 as text information.
  • the server 120 corresponds to the voice information sent by the user terminal based on the scene information corresponding to the user terminal and the text information corresponding to the voice information sent by the user terminal 110.
  • the text information corresponding to the noise corresponding to the scene where the user terminal 110 is located in the text information is filtered out, and the optimized text information is output, thereby improving the accuracy of speech recognition.
  • the data processing method provided by the embodiment of the present application is generally executed by the server 120, and correspondingly, the data processing device is generally set in the server 120.
  • the terminal and the server may also have similar functions, so as to execute the data processing solution provided by the embodiments of the present application.
  • Fig. 2 shows a flowchart of a speech recognition method according to an exemplary embodiment of the present disclosure, which may include the following steps:
  • Step S200 Obtain location information where the user terminal is located
  • Step S210 Determine the scene information where the user terminal is located based on the location information
  • Step S220 If the user's voice information is detected, recognize the voice information as text information;
  • Step S230 Input the text information and the scene information into the first machine learning model, and obtain the text information output by the first machine learning model optimized according to the scene information.
  • step S200 obtain location information where the user terminal is located.
  • the location information refers to information indicating a place or address.
  • the location of the user terminal is determined by obtaining the location information of the user terminal, and then the scene information and the scene corresponding to the user terminal are determined.
  • the noise corresponding to the information so that the subsequent first machine learning model can filter out the text information corresponding to the noise in the text information, and improve the accuracy of speech recognition.
  • the location information may be positioning information obtained through a GPS module built into the user terminal device, or may be text information indicating the location input by the owner of the user terminal device when using the user terminal device.
  • step S210 determine the scene information where the user terminal is located based on the location information.
  • the scene information refers to information indicating the scene in which the user is located.
  • the position of the user terminal is determined by the position information of the user, and then the scene in which the user terminal is located is determined, based on the The user terminal is located in the scene to determine the possible noise in the voice information obtained through the user terminal device and the text information corresponding to the noise.
  • step S210 may include:
  • Step S2101 Determine the location of the user terminal based on the location information
  • Step S2102 Determine the scene information where the user terminal is located based on the location of the user terminal and the corresponding relationship between the location and the scene pre-stored in the database.
  • the location information of the user terminal is obtained by acquiring the location information of the user terminal through the built-in GPS module of the user terminal. Based on the GPS location information, it is determined that the location of the user terminal is "the intersection of College Road and Chuangxin Road”. , Based on the location of the user terminal "Intersection of Xueyuan Road and Chuangxin Road", query the pre-stored map in the database indicating the scene information, and determine the location of the user terminal corresponding to the "Intersection of College Road and Chuangxin Road” The scene is "by the road”.
  • the location information of the user terminal may also be input by the owner of the user terminal device when installing and placing the user terminal device.
  • step S220 if the user's voice information is detected, the voice information is recognized as text information.
  • the decibel value corresponding to the user voice information is determined, and based on the judgment result of the decibel value corresponding to the user voice information, it is determined whether to recognize the user voice to prevent
  • the user terminal is in a noisy scene, the non-target user’s voice is recognized as text information, thereby further reducing the accuracy of the target user’s voice recognition, and at the same time, the user terminal does not detect the voice within the preset decibel range It is always in the standby state, reducing the power consumption of the user terminal to save resources.
  • the second machine learning model can be used to recognize the speech information as text information.
  • the machine learning model needs to be trained in advance.
  • the specific training process is shown in Figure 4 and can include the following steps:
  • Step S410 Receive a user-defined voice segment and text content corresponding to the voice segment
  • Step S420 Recognizing the morpheme feature of the voice segment, and generating a training sample of the second machine learning model according to the morpheme feature and the text content corresponding to the voice segment;
  • Step S430 Train the second machine learning model through the training sample to generate a voice recognition model, so as to recognize the voice information as text information based on the voice recognition model.
  • the speech information input by the user can be recognized, which can improve the accuracy of the input
  • the accuracy of speech recognition can meet the individual needs of the user.
  • the acquired user-defined voice segment is "you"
  • the corresponding text information is the custom voice segment such as "none” and the corresponding text information
  • the acquired custom voice segment and the corresponding text information are used as
  • the sample trains the second machine learning model to generate a voice recognition model.
  • the voice recognition model will recognize "you” as the corresponding text information "none” to meet the user's personalized needs .
  • the user-defined voice fragments obtained are voices with strong local dialects such as "nongshalei", and the corresponding text information is Mandarin text information corresponding to local dialects such as "What are you doing?"
  • the fragments and the corresponding text information are used as training samples to train the second machine learning model to generate a speech recognition model.
  • the acquired user voice is a local dialect such as "nongshalei”
  • the local dialect voice will be recognized as the corresponding "What are you doing?" "And other Mandarin text information to improve the accuracy of the recognition of different dialects.
  • the voice information may be recognized as text information through the following process:
  • Step S2201 Obtain a customized speech recognition set selected by the user terminal
  • Step S2202 Compare the voice information with the voice fragments included in the customized voice recognition set selected by the user terminal;
  • Step S2203 If the voice information matches the target voice segment included in the customized voice recognition set selected by the user terminal, then the target voice segment included in the customized voice recognition set selected by the user terminal is selected The corresponding text information is used as the recognized text information.
  • the text information corresponding to the obtained user voice can be determined, which can reduce the diversity of different dialects in the same language due to the complexity of the language
  • the difficulty of speech recognition will improve the accuracy of speech recognition.
  • it can also improve the accuracy of speech recognition for some users who like to mix different languages or local dialects.
  • the acquired user voice is "woverydiao"
  • the acquired user voice "woverydiao” is a mixture of Chinese, English and local dialects. It is difficult for the current common voice recognition model to recognize the acquired user voice.
  • the target voice segment that is the same as the acquired user voice, and the target voice segment corresponds to If the text information is "I am strong”, then the text information "I am strong” corresponding to the target speech segment is recognized as the text information of the user voice "wovreydiao" that should be obtained, so as to meet the personalized needs of different users and Improve the accuracy of the user's voice recognition.
  • step S230 input the text information and the scene information into the first machine learning model, and obtain the text information output by the first machine learning model optimized according to the scene information.
  • the text information is optimized by the machine learning model to filter out the noise contained in the text information, which can improve the accuracy of speech recognition and can quickly process a large amount of speech information in a short time. data.
  • the optimized text information can also be obtained by determining the noise text information corresponding to the scene information based on the scene information corresponding to the user terminal, and corresponding the scene information contained in the text information corresponding to the acquired voice information The noise text information is filtered out, thereby improving the accuracy of speech recognition.
  • the optimized text information may be obtained by the first machine learning model based on the text information and the scene information.
  • the machine learning model needs to be trained in advance.
  • the specific training process is as follows: As shown in Figure 6, it includes the following steps:
  • Step S610 Acquire text information and scene information corresponding to each voice information sample in the preset voice information sample set
  • Step S620 Determine the optimized text information corresponding to each voice information sample in the voice information sample set
  • Step S630 Input the obtained text information and scene information corresponding to the voice information sample into the first machine learning model, obtain the optimized text information output by the first machine learning model, and output the text information output by the first machine learning model.
  • the optimized text information is compared with the optimized text information corresponding to the determined voice information sample. If they are inconsistent, the parameters of the first machine learning model are adjusted until the output of the first machine learning model is optimized The latter text information is consistent with the optimized text information corresponding to the determined voice information sample.
  • the voice recognition method provided in the embodiment of the present application may further include the following steps.
  • Step S240 Input the optimized text information and the scene information into a preset intention recognition model, and obtain the intention information included in the voice information output by the intention recognition model.
  • the intention information refers to the needs and purpose of voice expression.
  • the technical solution of the embodiment shown in FIG. 7 can determine the user terminal by obtaining the scene information where the user terminal is located and the text information corresponding to the voice obtained by the user terminal.
  • the intent information corresponding to the acquired voice information is then executed according to the intent information corresponding to the voice information acquired by the user terminal.
  • the intent information can be obtained by means of an intent recognition model based on the optimized text information and the scene information.
  • the intent recognition model needs to be trained in advance.
  • the specific training process is shown in the figure As shown in 8, the following steps can be included::
  • Step S810 Obtain optimized text information and scene information corresponding to each voice information sample in the preset voice information sample set;
  • Step S820 Determine the intent information corresponding to the voice information sample
  • Step S830 Input the optimized text information and scene information corresponding to the obtained voice information sample into the intent recognition model, obtain the intent information output by the intent recognition model, and determine the intent information output by the intent recognition model.
  • the intent information corresponding to the voice information sample is compared, and if they are inconsistent, the parameters of the intent recognition model are adjusted until the intent information output by the intent recognition model is consistent with the determined intent information corresponding to the voice sample.
  • the apparatus 900 for speech recognition includes: a first acquisition module 910, a determination module 920, an identification module 930, and a second acquisition module 940, wherein:
  • the first obtaining module 910 is configured to obtain location information where the user terminal is located;
  • the determining module 920 is configured to determine the scene information where the user terminal is located based on the location information;
  • the recognition module 930 is configured to recognize the voice information as text information if the voice information of the user is detected;
  • the second obtaining module 940 is configured to input the text information and the scene information into a first machine learning model, and obtain the text information output by the first machine learning model optimized according to the scene information.
  • the determining module 920 may also be configured to determine the location of the user terminal based on the location information, based on the location of the user terminal, and the correspondence between the location and the scene pre-stored in the database Relationship, determine the scene information where the user terminal is located.
  • the speech recognition device further includes: a second machine learning model training module, configured to receive a user-defined speech segment and text content corresponding to the speech segment, and recognize the morphemes of the speech segment Feature, generate a training sample of a second machine learning model based on the morpheme feature and the text content corresponding to the speech segment, and train the second machine learning model to generate a speech recognition model based on the training sample
  • the voice recognition model recognizes the voice information as text information.
  • the recognition module 930 may also be configured to obtain a customized voice recognition set selected by the user terminal, and combine the voice information with the customized voice recognition set selected by the user terminal. The voice fragments are compared, and if the voice information matches the target voice fragments included in the customized voice recognition set selected by the user terminal, then the voice fragments included in the customized voice recognition set selected by the user terminal are compared. The text information corresponding to the target speech segment is used as the recognized text information.
  • the voice recognition device further includes: a first machine learning model training module, configured to obtain text information and scene information corresponding to each voice information sample in the preset voice information sample set, and determine the voice
  • the optimized text information corresponding to each voice information sample in the information sample set, the text information and scene information corresponding to the obtained voice information sample are input into the first machine learning model, and the optimized text information output by the first machine learning model is obtained Compare the optimized text information output by the first machine learning model with the optimized text information corresponding to the determined voice information sample, and if they are inconsistent, adjust the first machine learning model Until the optimized text information output by the first machine learning model is consistent with the optimized text information corresponding to the determined voice information sample.
  • the speech recognition device further includes: an intention recognition module configured to obtain optimized text information and scene information corresponding to each speech information sample in the preset speech information sample set.
  • the speech recognition module further includes: an intention recognition model training module, configured to obtain optimized text information and scene information corresponding to each speech information sample in the preset speech information sample set, and determine the speech
  • the intent information corresponding to the information sample, the optimized text information and scene information corresponding to the obtained voice information sample are input into the intent recognition model, the intent information output by the intent recognition model is obtained, and the intent recognition model is output
  • the intention information is compared with the determined intention information corresponding to the voice information sample, and if they are inconsistent, the parameters of the intention recognition model are adjusted until the intention information output by the intention recognition model matches the determined intention corresponding to the speech sample The information is consistent.
  • modules or units of the device for action execution are mentioned in the above detailed description, this division is not mandatory.
  • the features and functions of two or more modules or units described above may be embodied in one module or unit.
  • the features and functions of a module or unit described above can be further divided into multiple modules or units to be embodied.
  • the exemplary embodiments described herein can be implemented by software, or can be implemented by combining software with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure can be embodied in the form of a software product, and the software product can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.) or a network
  • a non-volatile storage medium which can be a CD-ROM, U disk, mobile hard disk, etc.
  • the above includes several instructions to make a computing device (which may be a personal computer, server, mobile terminal, or network device, etc.) execute the method according to the embodiment of the present disclosure.
  • the electronic device 1000 according to this embodiment of the present application will be described below with reference to FIG. 10.
  • the electronic device 1000 shown in FIG. 0 is only an example, and should not bring any limitation to the function and scope of use of the embodiments of the present application.
  • the electronic device 1000 is represented in the form of a general-purpose computing device.
  • the components of the electronic device 1000 may include but are not limited to: the aforementioned at least one processing unit 1010, the aforementioned at least one storage unit 1020, and a bus 1030 connecting different system components (including the storage unit 1020 and the processing unit 1010).
  • the storage unit stores program code, and the program code can be executed by the processing unit 1010, so that the processing unit 1010 executes the various exemplary methods described in the “exemplary method” section of this specification.
  • the processing unit 1010 may perform step S200 as shown in FIG. 2: obtain the location information of the user terminal; step S210: determine the scene information where the user terminal is located based on the location information; step S220: if If the user’s voice information is detected, the voice information is recognized as text information; step S230: the text information and the scene information are input into the first machine learning model, and the output of the first machine learning model is obtained based on the Text information after scene information is optimized;
  • the storage unit 1020 may include a readable medium in the form of a volatile storage unit, such as a random access storage unit (RAM) 10201 and/or a cache storage unit 10202, and may further include a read-only storage unit (ROM) 10203.
  • RAM random access storage unit
  • ROM read-only storage unit
  • the storage unit 1020 may also include a program/utility tool 10204 having a set (at least one) program module 10205.
  • program module 10205 includes but is not limited to: an operating system, one or more application programs, other program modules, and program data, Each of these examples or some combination may include the implementation of a network environment.
  • the bus 1030 may represent one or more of several types of bus structures, including a storage unit bus or a storage unit controller, a peripheral bus, a graphics acceleration port, a processing unit, or a local area using any bus structure among multiple bus structures. bus.
  • the electronic device 1000 may also communicate with one or more external devices 500 (such as keyboards, pointing devices, Bluetooth devices, etc.), and may also communicate with one or more devices that enable a user to interact with the electronic device 1000, and/or communicate with Any device (such as a router, modem, etc.) that enables the electronic device 1000 to communicate with one or more other computing devices. This communication can be performed through an input/output (I/O) interface 1050. Moreover, the electronic device 1000 may also communicate with one or more networks (for example, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet) through the network adapter 1060.
  • LAN local area network
  • WAN wide area network
  • public network such as the Internet
  • the network adapter 1060 communicates with other modules of the electronic device 1000 through the bus 1030. It should be understood that although not shown in the figure, other hardware and/or software modules can be used in conjunction with the electronic device 1000, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives And data backup storage system, etc.
  • the exemplary embodiments described herein can be implemented by software, or can be implemented by combining software with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , Including several instructions to make a computing device (which may be a personal computer, server, terminal device, or network device, etc.) execute the method according to the embodiment of the present application.
  • a non-volatile storage medium can be a CD-ROM, U disk, mobile hard disk, etc.
  • Including several instructions to make a computing device which may be a personal computer, server, terminal device, or network device, etc.
  • a computer-readable storage medium is also provided.
  • the computer-readable storage medium is a volatile storage medium or a non-volatile storage medium.
  • the program product of the above method In some possible implementation manners, various aspects of the present application can also be implemented in the form of a program product, which includes program code.
  • the program product runs on a terminal device, the program code is used to enable the The terminal device executes the steps according to various exemplary embodiments of the present application described in the above-mentioned "Exemplary Method" section of this specification.
  • a program product 1100 for implementing the above method according to an embodiment of the present application is described. It can adopt a portable compact disk read-only memory (CD-ROM) and include program code, and can be installed in a terminal device For example, running on a personal computer.
  • CD-ROM compact disk read-only memory
  • the program product of this application is not limited to this.
  • the readable storage medium can be any tangible medium that contains or stores a program, and the program can be used by or combined with an instruction execution system, device, or device.
  • the program product can use any combination of one or more readable media.
  • the readable medium may be a readable signal medium or a readable storage medium.
  • the readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or a combination of any of the above. More specific examples (non-exhaustive list) of readable storage media include: electrical connections with one or more wires, portable disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Type programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • the computer-readable signal medium may include a data signal propagated in baseband or as a part of a carrier wave, and readable program code is carried therein. This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • the readable signal medium may also be any readable medium other than a readable storage medium, and the readable medium may send, propagate, or transmit a program for use by or in combination with the instruction execution system, apparatus, or device.
  • the program code contained on the readable medium can be transmitted by any suitable medium, including but not limited to wireless, wired, optical cable, RF, etc., or any suitable combination of the foregoing.
  • the program code used to perform the operations of this application can be written in any combination of one or more programming languages.
  • the programming languages include object-oriented programming languages—such as Java, C++, etc., as well as conventional procedural Programming language-such as "C" language or similar programming language.
  • the program code can be executed entirely on the user's computing device, partly on the user's device, executed as an independent software package, partly on the user's computing device and partly executed on the remote computing device, or entirely on the remote computing device or server Executed on.
  • the remote computing device can be connected to a user computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or can be connected to an external computing device (for example, using Internet service providers) Business to connect via the Internet).
  • LAN local area network
  • WAN wide area network
  • Internet service providers Internet service providers

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Telephonic Communication Services (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种语音识别的方法、装置(900)、电子设备(1000)和存储介质,涉及生物识别领域,方法包括:获取用户终端(110)所处的位置信息(S200);基于位置信息确定用户终端(110)所处的场景信息(S210);若检测到用户的语音信息,则将语音信息识别为文本信息(S220);将文本信息与场景信息输入第一机器学习模型,获取由第一机器学习模型输出的根据场景信息进行优化后的文本信息(S230)。提高了语音识别的准确率。

Description

语音识别的方法、装置、电子设备和存储介质 技术领域
本申请要求于2019年5月22日提交中国专利局、申请号为201910430228.7,发明名称为“语音识别的方法、装置、电子设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请涉及人工智能的生物识别领域,特别是涉及一种语音识别的方法、装置、电子设备和存储介质。
背景技术
随着智能技术的发展,进行语音识别并根据识别的语音进行控制已成为智能技术应用中的一项重要内容,各种智能产品中也会应用语音识别技术实现智能化控制,而随着智能产品的增加以及对语音识别的准确度的要求越来越高,各种语音识别技术层出不穷。
目前常用的语音识别方式是通过提取用户发出的待识别语音信息的特征,再根据识别算法对该用户发出的待识别语音信息进行识别。
技术问题
然而,发明人意识到在一些场景(如马路上)使用语音识别功能捕获到的待识别语音除了包含有人的语音之外,还会包含有诸如汽车鸣笛等杂音,其中人的语音信息是有效待识别的语音,而汽车鸣笛声等杂音是噪音,语音识别时会将汽车鸣笛等噪声也一同识别出来,进而导致语音识别准确性不足。
技术解决方案
本申请实施例的目的在于提供一种语音识别方法、装置、计算机可读介质及电子设备,进而可以至少在一定程度上克服现有技术中语音识别准确率低的问题。
根据本申请的第一方面,提供了一种语音识别的方法,包括:获取用户终端所处的位置信息;基于所述位置信息确定所述用户终端所处的场景信息;若检测到用户的语音信息,则将所述语音信息识别为文本信息;将所述文本信息与所述场景信息输入第一机器学习模型,获取由第一机器学习模型输出的根据所述场景信息进行优化后的文本信息。
根据本申请的第二方面提供了一种语音识别装置,包括:第一获取模块:用于获取用户终端所处的位置信息;确定模块:用于基于所述位置信息确定所述用户终端所处的场景信息;识别模块:用于若检测到用户的语音信息,则将所述语音信息识别为文本信息;第二获取模块:用于将所述文本信息与所述场景信息输入第一机器学习模型,获取由第一机器学习模型输出的根据所述场景信息进行优化后的文本信息。
根据本申请的第三方面提供了一种语音识别的电子设备,包括:存储器,配置为存储可执行指令;处理器,配置为执行存储器中存储的可执行指令,以执行一种语音识别方法:
其中,所述语音识别方法包括:
获取用户终端所处的位置信息;
基于所述位置信息确定所述用户终端所处的场景信息;
若检测到用户的语音信息,则将所述语音信息识别为文本信息;
将所述文本信息与所述场景信息输入第一机器学习模型,获取由第一机器学习模型输出的根据所述场景信息进行优化后的文本信息。
根据本申请的第四方面提供了一种计算机可读存储介质,其存储有计算机程序指令,当所述计算机指令被计算机执行时,使计算机执行语音识别方法:
其中,所述语音识别方法包括:
获取用户终端所处的位置信息;
基于所述位置信息确定所述用户终端所处的场景信息;
若检测到用户的语音信息,则将所述语音信息识别为文本信息;
将所述文本信息与所述场景信息输入第一机器学习模型,获取由第一机器学习模型输出的根据所述场景信息进行优化后的文本信息。
有益效果
在本申请的一些实施例中,通过获取用户终端所处的位置信息,基于所述位置信息确定所述用户终端所处的位置,将所述用户终端所处的位置与数据库中预存的位置与场景对应关系表确定所述用户终端设备所处的场景信息,当所述用户终端检测到有语音信息输入时将所述语音信息识别为文本信息,再将所述文本信息与所述场景信息一起输入预设的第一机器学习模型中,由第一机器学习模型将所述文本信息中包含的所述场景信息对应的噪声对应的文本信息滤除,获得优化后的文本信息。可见,本申请实施例,可快速且准确的滤除待识别语音对应的文本信息中包含的所述场景对应的噪声对应的文本信息;进而提高语音识别的准确性。
附图说明
图1示出根据本申请一示例实施方式的语音识别方法的使用环境的系统架构图。
图2示出根据本公开一示例实施方式的语音识别方法的流程图。
图3示出根据本公开一示例实施方式的基于所述位置信息确定所述用户终端所处的场景信息的详细流程图。
图4示出根据本公开一示例实施方式的在若检测到用户的语音信息,则将所述语音信息识别为文本信息之前的流程图。
图5示出根据本公开一示例实施方式的若检测到用户的语音信息,则将所述语音信息识别为文本信息的详细流程图。
图6示出根据本公开一示例实施方式的在将所述文本信息和所述场景信息输入第一机器学习模型,获取由第一机器学习模型输出的优化后的文本信息之前的流程图。
图7示出根据本公开一示例实施方式的在获取由第一机器学习模型输出的根据所述场景信息进行优化后的文本信息之后的流程图。
图8示出根据本公开一示例实施方式的在将所述优化后的文本信息和所述场景信息输入预设的意图识别模型之前的流程图。
图9示出根据本公开一示例实施方式的语音识别装置的结构框图
图10示出根据本公开一示例实施方式的语音识别的电子设备图。
图11示出根据本公开一示例实施方式的语音识别的计算机可读存储介质图。
本发明的实施方式
本申请涉及人工智能的语音处理技术领域,具体地,图1示出根据本公开一示例实施方式的语音识别方法的使用环境的构架图:
该使用环境包括用户终端110、服务器120、数据库130。
应该理解,图1中的用户终端、服务器和数据库的数目仅仅是示意性的。根据实现需要,可以具有任意数目的用户终端、服务器和数据库。比如服务器120可以是多个服务器组成的服务器集群等。
在一实施例中服务器120通过用户终端110内置的全球定位系统(GPS)获取用户终端110所处的位置信息,服务器120基于用户终端110所处的位置信息确定用户终端110所处的位置,服务器120基于用户终端110所处位置与数据库130中预存的位置与场景信息对应关系确定用户终端110对应的场景信息,当用户终端110检测到有用户语音输入时,用户终端110将获取的用户语音信息发送至服务器120,服务器120将用户终端110发送的语音信息识别为文本信息,服务器120基于用户终端对应的场景信息和用户终端110发送的语音信息对应的文本信息,将用户终端发送的语音信息对应的文本信息中的用户终端110所处场景对应的噪声对应的文本信息滤除,输出优化后的文本信息,进而来提高语音识别的准确性。
需要说明的是,本申请实施例所提供的数据处理方法一般由服务器120执行,相应地,数据处理装置一般设置于服务器120中。但是,在本申请的其它实施例中,终端也可以与服务器具有相似的功能,从而执行本申请实施例所提供的数据处理方案。
图2示出根据本公开一示例实施方式的语音识别方法的流程图,可以包括如下步骤:
步骤S200:获取用户终端所处的位置信息;
步骤S210:基于所述位置信息确定所述用户终端所处的场景信息;
步骤S220:若检测到用户的语音信息,则将所述语音信息识别为文本信息;
步骤S230:将所述文本信息与所述场景信息输入第一机器学习模型,获取由第一机器学习模型输出的根据所述场景信息进行优化后的文本信息。
下面,将结合附图对本示例实施方式中上述数据字典生成的各步骤进行详细的解释以及说明。
在步骤S200中:获取用户终端所处的位置信息。
在一实施例中,所述位置信息是指表明所处的地方或地址的信息,通过获取用户终端所处的位置信息来确定用户终端所处的位置,进而确定用户终端对应的场景信息及场景信息对应的噪声,以便于后续的第一机器学习模型将所述文本信息中的噪声对应的文本信息滤除,调高语音识别的准确性。
在一实施例中,位置信息可以是通过用户终端设备中内置的GPS模块获取的定位信息,也可以是用户终端设备的所有者在使用该用户终端设备时输入的表明位置的文本信息。
继续参照图2所示,在步骤S210中:基于所述位置信息确定所述用户终端所处的场景信息。
在一实施例中,所述场景信息是指表明所处情景的信息,通过用户所处的位置信息来确定该用户终端所处的位置,进而确定该用户终端所处的场景,再基于所述用户终端所处场景来确定通过用户终端设备获取的语音信息可能存在的噪声及噪声对应的文本信息。
在一实施例中,如图3所示,步骤S210可以包括:
步骤S2101:基于所述位置信息,确定所述用户终端所处位置;
步骤S2102:基于所述用户终端所处位置,以及数据库中预存的位置与场景的对应关系,确定所述用户终端所处的场景信息。
在一实施例中,用户终端所处的位置信是通过用户终端内置的GPS模块获取用户终端的定位信息,基于该GPS定位信息,确定用户终端所处位置为“学院路与创新路交叉口”,基于该用户终端所处的位置“学院路与创新路交叉口”,查询预存在数据库中标明各场景信息的地图,确定该用户终端所处的位置“学院路与创新路交叉口”对应的场景为“马路旁”。
在一实施例中,用户终端所处的位置信息还可以是由该用户终端设备所有者在安装放置该用户终端设备时输入的。
继续参照图2所示,在步骤S220中:若检测到用户的语音信息,则将所述语音信息识别为文本信息。
在一实施例中,在检测到用户语音信息的时候,确定该用户语音信息对应的分贝值,基于该用户语音信息对应的分贝值的判断结果,来判断是否对该用户语音进行识别,以防止当用户终端处于较为嘈杂的场景时,将非目标用户的语音识别为文本信息,从而进一步降低对目标用户语音识别的准确率,同时也使用户终端在未检测到预设分贝值范围内的语音时处于待机状态,降低用户终端的耗电以节约资源。
在一实施例中,可以通过第二机器学习模型来将语音信息识别为文本信息,在这种情况下需要事先对机器学习模型进行训练,具体训练过程如图4所示,可以包括如下步骤:
步骤S410:接收用户自定义的语音片段,以及与所述语音片段对应的文本内容;
步骤S420:识别所述语音片段的语素特征,根据所述语素特征和与所述语音片段对应的文本内容生成第二机器学习模型的训练样本;
步骤S430:通过所述训练样本对所述第二机器学习模型进行训练生成语音识别模型,以基于该语音识别模型将所述语音信息识别为文本信息。
通过将用户自定义的语音片段及所述自定义语音片段对应的文本信息作为训练样本生成机器学习模型的方式,来对所述用户输入的语音信息进行识别,既可提高对所述用于输入语音识别的准确率,又能满足所述用户的个性化需求。
在一实施例中,获取的用户自定义语音片段是“you”,对应的文本信息是“无”等自定义语音片段与对应的文本信息, 将获取的自定义语音片段与对应的文本信息作为样本对第二机器学习模型进行训练生成语音识别模型,当获取的用户语音为“you”时,语音识别模型会将“you”识别为对应的文本信息“无”,以满足用户的个性化需求。
在一时候例中,获取的用户自定义语音片段是“nongshalei”等带有浓重地方方言的语音,对应的文本信息为“干什么呢”等地方方言对应的普通话文本信息,将获取的自定义语音片段与对应的文本信息作为训练样本对第二机器学习模型进行训练生成语音识别模型,当获取的用户语音为“nongshalei”等地方方言时,则会将该地方方言语音识别为对应的“干什么呢”等普通话文本信息,以提高对不同方言识别的准确率。
在一实施例中,如图5所示,步骤S220还可以通过如下流程将语音信息识别为文本信息:
步骤S2201:获取用户终端选定的自定义语音识别集;
步骤S2202:将所述语音信息与所述用户终端选定的自定义语音识别集中包含的语音片段进行比对;
步骤S2203:若所述语音信息与所述用户终端选定的自定义语音识别集中包含的目标语音片段相匹配,则将所述用户终端选定的自定义语音识别集中包含的所述目标语音片段对应的文本信息作为识别到的文本信息。
通过将获取的用户语音与用户终端选定的自定义语音识别集中的语音片段进行匹配,来确定获取的用户语音对应的文本信息,可降低因语种的繁杂与同一语种间各地方言的差异化带来的语音识别的难度,,进而提高语音识别的准确率,同时也能提高对一些喜欢将不同语种语言或地方方言混杂使用的用户对应的语音识别的准确率。
在一实施例中,获取的用户语音为“woverydiao”,该获取的用户语音“woverydiao”是一个混杂中英文及地方方言的语音信息,对于目前常见的语音识别模型很难识别该获取的用户语音,在本申请中,通过将该获取的用户语音与用户终端选定的自定义语音识别集中包含的语音片段进行匹配,发现存在与该获取的用户语音相同的目标语音片段,该目标语音片段对应的文本信息为“我很强”,则将该目标语音片段对应的文本信息“我很强”识别为该获取的用户语音“wovreydiao”的文本信息,这样即满足了不同用户的个性化需求又提高了对该用户语音识别的准确率。
继续参照图2所示,在步骤S230中:将所述文本信息与所述场景信息输入第一机器学习模型,获取由第一机器学习模型输出的根据所述场景信息进行优化后的文本信息。
在上述实施例中,通过机器学习模型对所述文本信息进行优化,滤除所述文本信息中包含的噪声,既可以提高语音识别的准确率又可短时间内快速处理大量的语音信息对应的数据。
在一实施例中,获取优化后的文本信息还可通过,基于用户终端对应的场景信息,确定该场景信息对应的噪声文本信息,将获取的语音信息对应的文本信息中包含的该场景信息对应的噪声文本信息滤除,进而提高语音识别的准确率。
在一实施例中,可以基于所述文本信息和所述场景信息通过第一机器学习模型的方式获取优化后的文本信息,在这种情况下需要事先对机器学习模型进行训练,具体训练过程如图6所示, 包括如下步骤:
步骤S610:获取预先设置的语音信息样本集合中每个语音信息样本对应的文本信息和场景信息;
步骤S620:确定所述语音信息样本集合中每个语音信息样本对应的优化后的文本信息;
步骤S630:将获取的所述语音信息样本对应的文本信息和场景信息输入第一机器学习模型,获取由第一机器学习模型输出的优化后的文本信息,将所述第一机器学习模型输出的优化后的文本信息与确定的所述语音信息样本对应的优化后的文本信息进行比对,如不一致,则调整所述第一机器学习模型的参数,直至所述第一机器学习模型输出的优化后的文本信息与确定的所述语音信息样本对应的优化后的文本信息一致。
在一实施例中,如图7所示,在图2所示的各个步骤之后,本申请实施例提供的语音识别的方法还可以包括如下步骤。
步骤S240:将所述优化后的文本信息和所述场景信息输入预设的意图识别模型,获取由所述意图识别模型输出的所述语音信息所包含的意图信息。
在一实施例中,意图信息是指语音表达的需求及目的,图7所示的实施例的技术方案可通过获取用户终端所处的场景信息与用户终端获取语音对应的文本信息,确定用户终端获取的语音信息对应的意图信息,进而按照用户终端获取的语音信息对应的意图信息执行相应指令,如当用户A晚上回家后,在房间B说“开灯”,通常屋内的智能家居管理系统会将屋内的照明灯全部打开,但通过本申请的技术方案,基于接收到用户A“开灯”语音的用户终端确定用户A所处的场景为“房间B”,确定用户A“开灯”语音对应的意图信息为“打开房间B”的灯,则会将该意图信息发送至屋内的智能家居管理系统来打开房间B的灯,从而提高对用户语音指令的判断准确性,提高用户体验度。
在一实施例中,可以基于所述优化后的文本信息和所述场景信息通过意图识别模型的方式获取所述意图信息,这种情况下需要事先对意图识别模型进行训练,具体训练过程如图8所示,可以包括如下步骤::
步骤S810:获取预先设置的语音信息样本集合中每个语音信息样本对应的优化后的文本信息和场景信息;
步骤S820:确定所述语音信息样本对应的意图信息;
步骤S830:将获取的所述语音信息样本对应的优化后的文本信息和场景信息输入意图识别模型,获取由所述意图识别模型输出的意图信息,将所述意图识别模型输出的意图信息与确定的所述语音信息样本对应的意图信息进行比对,如不一致则调整所述意图识别模型的参数,直至所述意图识别模型输出的意图信息与确定的所述语音样本对应的意图信息相一致。
本申请还提供了一种语音识别的装置。参考图9所示,所述语音识别的装置900包括:第一获取模块910,确定模块920,识别模块930,第二获取模块940,其中:
第一获取模块910,用于获取用户终端所处的位置信息;
确定模块920,用于基于所述位置信息确定所述用户终端所处的场景信息;
识别模块930,用于若检测到用户的语音信息,则将所述语音信息识别为文本信息;
第二获取模块940,用于将所述文本信息与所述场景信息输入第一机器学习模型,获取由第一机器学习模型输出的根据所述场景信息进行优化后的文本信息。
在一实施例中,所述确定模块920还可配置为:基于所述位置信息,确定所述用户终端所处位置,基于所述用户终端所处位置,以及数据库中预存的位置与场景的对应关系,确定所述用户终端所处的场景信息。
在一实施例中,所述语音识别装置还包括:第二机器学习模型训练模块,用于接收用户自定义的语音片段,以及与所述语音片段对应的文本内容,识别所述语音片段的语素特征,根据所述语素特征和与所述语音片段对应的文本内容生成第二机器学习模型的训练样本,通过所述训练样本对所述第二机器学习模型进行训练生成语音识别模型,以基于该语音识别模型将所述语音信息识别为文本信息。
在一实施例中,所述识别模块930还可配置为:用于获取用户终端选定的自定义语音识别集,将所述语音信息与所述用户终端选定的自定义语音识别集中包含的语音片段进行比对,若所述语音信息与所述用户终端选定的自定义语音识别集中包含的目标语音片段相匹配,则将所述用户终端选定的自定义语音识别集中包含的所述目标语音片段对应的文本信息作为识别到的文本信息。
在一实施例中,所述语音识别装置还包括:第一机器学习模型训练模块,用于获取预先设置的语音信息样本集合中每个语音信息样本对应的文本信息和场景信息,确定所述语音信息样本集合中每个语音信息样本对应的优化后的文本信息,将获取的所述语音信息样本对应的文本信息和场景信息输入第一机器学习模型,获取由第一机器学习模型输出的优化后的文本信息,将所述第一机器学习模型输出的优化后的文本信息与确定的所述语音信息样本对应的优化后的文本信息进行比对,如不一致,则调整所述第一机器学习模型的参数,直至所述第一机器学习模型输出的优化后的文本信息与确定的所述语音信息样本对应的优化后的文本信息一致。
在一实施例中,所述语音识别装置还包括:意图识别模块,用于获取预先设置的语音信息样本集合中每个语音信息样本对应的优化后的文本信息和场景信息。
在一实施例中所述语音识别模块还包括:意图识别模型训练模块,用于获取预先设置的语音信息样本集合中每个语音信息样本对应的优化后的文本信息和场景信息,确定所述语音信息样本对应的意图信息,将获取的所述语音信息样本对应的优化后的文本信息和场景信息输入意图识别模型,获取由所述意图识别模型输出的意图信息,将所述意图识别模型输出的意图信息与确定的所述语音信息样本对应的意图信息进行比对,如不一致则调整所述意图识别模型的参数,直至所述意图识别模型输出的意图信息与确定的所述语音样本对应的意图信息相一致。
上述语音识别的装置中各模块的具体细节已经在对应的方法中进行了详细的描述,因此此处不再赘述。
应当注意,尽管在上文详细描述中提及了用于动作执行的设备的若干模块或者单元,但是这种划分并非强制性的。实际上,根据本公开的实施方式,上文描述的两个或更多模块或者单元的特征和功能可以在一个模块或者单元中具体化。反之,上文描述的一个模块或者单元的特征和功能可以进一步划分为由多个模块或者单元来具体化。
此外,尽管在附图中以特定顺序描述了本公开中方法的各个步骤,但是,这并非要求或者暗示必须按照所述特定顺序来执行这些步骤,或是必须执行全部所示的步骤才能实现期望的结果。附加的或备选的,可以省略某些步骤,将多个步骤合并为一个步骤执行,以及/或者将一个步骤分解为多个步骤执行等。
通过以上的实施方式的描述,本领域的技术人员易于理解,这里描述的示例实施方式可以通过软件实现,也可以通过软件结合必要的硬件的方式来实现。因此,根据本公开实施方式的技术方案可以以软件产品的形式体现出来,所述软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中或网络上,包括若干指令以使得一台计算设备(可以是个人计算机、服务器、移动终端、或者网络设备等)执行根据本公开实施方式的方法。
在本申请的示例性实施例中,还提供了一种能够实现上述方法的电子设备。
所属技术领域的技术人员能够理解,本申请的各个方面可以实现为系统、方法或程序产品。因此,本申请的各个方面可以具体实现为以下形式,即:完全的硬件实施方式、完全的软件实施方式(包括固件、微代码等),或硬件和软件方面结合的实施方式,这里可以统称为“电路”、“模块”或“系统”。
下面参照图10来描述根据本申请的这种实施方式的电子设备1000。图0显示的电子设备1000仅仅是一个示例,不应对本申请实施例的功能和使用范围带来任何限制。
如图10所示,电子设备1000以通用计算设备的形式表现。电子设备1000的组件可以包括但不限于:上述至少一个处理单元1010、上述至少一个存储单元1020、连接不同系统组件(包括存储单元1020和处理单元1010)的总线1030。
其中,所述存储单元存储有程序代码,所述程序代码可以被所述处理单元1010执行,使得所述处理单元1010执行本说明书上述“示例性方法”部分中描述的根据本申请各种示例性实施方式的步骤。例如,所述处理单元1010可以执行如图2中所示步骤S200:获取用户终端所处的位置信息;步骤S210:基于所述位置信息确定所述用户终端所处的场景信息;步骤S220:若检测到用户的语音信息,则将所述语音信息识别为文本信息;步骤S230:将所述文本信息与所述场景信息输入第一机器学习模型,获取由第一机器学习模型输出的根据所述场景信息进行优化后的文本信息;
存储单元1020可以包括易失性存储单元形式的可读介质,例如随机存取存储单元(RAM)10201和/或高速缓存存储单元10202,还可以进一步包括只读存储单元(ROM)10203。
存储单元1020还可以包括具有一组(至少一个)程序模块10205的程序/实用工具10204,这样的程序模块10205包括但不限于:操作系统、一个或者多个应用程序、其它程序模块以及程序数据,这些示例中的每一个或某种组合中可能包括网络环境的实现。
总线1030可以为表示几类总线结构中的一种或多种,包括存储单元总线或者存储单元控制器、外围总线、图形加速端口、处理单元或者使用多种总线结构中的任意总线结构的局域总线。
电子设备1000也可以与一个或多个外部设备500(例如键盘、指向设备、蓝牙设备等)通信,还可与一个或者多个使得用户能与该电子设备1000交互的设备通信,和/或与使得该电子设备1000能与一个或多个其它计算设备进行通信的任何设备(例如路由器、调制解调器等等)通信。这种通信可以通过输入/输出(I/O)接口1050进行。并且,电子设备1000还可以通过网络适配器1060与一个或者多个网络(例如局域网(LAN),广域网(WAN)和/或公共网络,例如因特网)通信。如图所示,网络适配器1060通过总线1030与电子设备1000的其它模块通信。应当明白,尽管图中未示出,可以结合电子设备1000使用其它硬件和/或软件模块,包括但不限于:微代码、设备驱动器、冗余处理单元、外部磁盘驱动阵列、RAID系统、磁带驱动器以及数据备份存储系统等。
通过以上的实施方式的描述,本领域的技术人员易于理解,这里描述的示例实施方式可以通过软件实现,也可以通过软件结合必要的硬件的方式来实现。因此,根据本申请实施方式的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中或网络上,包括若干指令以使得一台计算设备(可以是个人计算机、服务器、终端装置、或者网络设备等)执行根据本申请实施方式的方法。
在本申请的示例性实施例中,还提供了一种计算机可读存储介质,所述计算机可读存储介质为易失性存储介质或非易失性存储介质,其上存储有能够实现本说明书上述方法的程序产品。在一些可能的实施方式中,本申请的各个方面还可以实现为一种程序产品的形式,其包括程序代码,当所述程序产品在终端设备上运行时,所述程序代码用于使所述终端设备执行本说明书上述“示例性方法”部分中描述的根据本申请各种示例性实施方式的步骤。
参考图11所示,描述了根据本申请的实施方式的用于实现上述方法的程序产品1100,其可以采用便携式紧凑盘只读存储器(CD-ROM)并包括程序代码,并可以在终端设备,例如个人电脑上运行。然而,本申请的程序产品不限于此,在本文件中,可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。
所述程序产品可以采用一个或多个可读介质的任意组合。可读介质可以是可读信号介质或者可读存储介质。可读存储介质例如可以为但不限于电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。
计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了可读程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。可读信号介质还可以是可读存储介质以外的任何可读介质,该可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。
可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于无线、有线、光缆、RF等等,或者上述的任意合适的组合。
可以以一种或多种程序设计语言的任意组合来编写用于执行本申请操作的程序代码,所述程序设计语言包括面向对象的程序设计语言—诸如Java、C++等,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算设备上执行、部分地在用户设备上执行、作为一个独立的软件包执行、部分在用户计算设备上部分在远程计算设备上执行、或者完全在远程计算设备或服务器上执行。在涉及远程计算设备的情形中,远程计算设备可以通过任意种类的网络,包括局域网(LAN)或广域网(WAN),连接到用户计算设备,或者,可以连接到外部计算设备(例如利用因特网服务提供商来通过因特网连接)。
此外,上述附图仅是根据本申请示例性实施例的方法所包括的处理的示意性说明,而不是限制目的。易于理解,上述附图所示的处理并不表明或限制这些处理的时间顺序。另外,也易于理解,这些处理可以是例如在多个模块中同步或异步执行的。

Claims (20)

  1. 一种语音识别方法,其中,所述方法包括:
    获取用户终端所处的位置信息;
    基于所述位置信息确定所述用户终端所处的场景信息;
    若检测到用户的语音信息,则将所述语音信息识别为文本信息;
    将所述文本信息与所述场景信息输入第一机器学习模型,获取由第一机器学习模型输出的根据所述场景信息进行优化后的文本信息。
  2. 根据权利要求1所述的方法,其中,所述基于所述位置信息确定所述用户终端所处的场景信息包括:
    基于所述位置信息,确定所述用户终端所处位置;
    基于所述用户终端所处位置,以及数据库中预存的位置与场景的对应关系,确定所述用户终端所处的场景信息。
  3. 根据权利要求1所述的方法,其中,在所述将所述语音信息识别为文本信息之前,所述方法还包括:
    接收用户自定义的语音片段,以及与所述语音片段对应的文本内容;
    识别所述语音片段的语素特征,根据所述语素特征和与所述语音片段对应的文本内容生成第二机器学习模型的训练样本;
    通过所述训练样本对所述第二机器学习模型进行训练生成语音识别模型,以基于该语音识别模型将所述语音信息识别为文本信息。
  4. 根据权利要求1所述的方法,其中,所述将所述语音信息识别为文本信息包括:
    获取用户终端选定的自定义语音识别集;
    将所述语音信息与所述用户终端选定的自定义语音识别集中包含的语音片段进行比对;
    若所述语音信息与所述用户终端选定的自定义语音识别集中包含的目标语音片段相匹配,则将所述用户终端选定的自定义语音识别集中包含的所述目标语音片段对应的文本信息作为识别到的文本信息。
  5. 根据权利要求1所述的方法,其中,在所述将所述文本信息和所述场景信息输入第一机器学习模型,获取由第一机器学习模型输出的优化后的文本信息之前,所述方法还包括:
    获取预先设置的语音信息样本集合中每个语音信息样本对应的文本信息和场景信息;
    确定所述语音信息样本集合中每个语音信息样本对应的优化后的文本信息;
    将获取的所述语音信息样本对应的文本信息和场景信息输入第一机器学习模型,获取由第一机器学习模型输出的优化后的文本信息,将所述第一机器学习模型输出的优化后的文本信息与确定的所述语音信息样本对应的优化后的文本信息进行比对,如不一致,则调整所述第一机器学习模型的参数,直至所述第一机器学习模型输出的优化后的文本信息与确定的所述语音信息样本对应的优化后的文本信息一致。
  6. 根据权利要求1至5中任一项所述的方法,其中,在所述获取由第一机器学习模型输出的根据所述场景信息进行优化后的文本信息之后,所述方法还包括:
    将所述优化后的文本信息和所述场景信息输入预设的意图识别模型,获取由所述意图识别模型输出的所述语音信息所包含的意图信息。
  7. 根据权利要求6所述的方法,其中,在所述将所述优化后的文本信息和所述场景信息输入预设的意图识别模型之前,所述方法还包括:
    获取预先设置的语音信息样本集合中每个语音信息样本对应的优化后的文本信息和场景信息;
    确定所述语音信息样本对应的意图信息;
    将获取的所述语音信息样本对应的优化后的文本信息和场景信息输入意图识别模型,获取由所述意图识别模型输出的意图信息,将所述意图识别模型输出的意图信息与确定的所述语音信息样本对应的意图信息进行比对,如不一致则调整所述意图识别模型的参数,直至所述意图识别模型输出的意图信息与确定的所述语音样本对应的意图信息相一致。
  8. 一种语音识别装置,其中,包括:
    第一获取模块:用于获取用户终端所处的位置信息;
    确定模块:用于基于所述位置信息确定所述用户终端所处的场景信息;
    识别模块:用于若检测到用户的语音信息,则将所述语音信息识别为文本信息;
    第二获取模块:用于将所述文本信息与所述场景信息输入第一机器学习模型,获取由第一机器学习模型输出的根据所述场景信息进行优化后的文本信息。
  9. 一种语音识别的电子设备,其中,包括:
    存储器,配置为存储可执行指令;
    处理器,配置为执行存储器中存储的可执行指令,以实现一种语音识别方法:
    其中,所述语音识别方法包括:
    获取用户终端所处的位置信息;
    基于所述位置信息确定所述用户终端所处的场景信息;
    若检测到用户的语音信息,则将所述语音信息识别为文本信息;
    将所述文本信息与所述场景信息输入第一机器学习模型,获取由第一机器学习模型输出的根据所述场景信息进行优化后的文本信息。
  10. 根据权利要求9所述的电子设备,其中,所述基于所述位置信息确定所述用户终端所处的场景信息包括:
    基于所述位置信息,确定所述用户终端所处位置;
    基于所述用户终端所处位置,以及数据库中预存的位置与场景的对应关系,确定所述用户终端所处的场景信息。
  11. 根据权利要求9所述的电子设备,其中,在所述将所述语音信息识别为文本信息之前,所述方法还包括:
    接收用户自定义的语音片段,以及与所述语音片段对应的文本内容;
    识别所述语音片段的语素特征,根据所述语素特征和与所述语音片段对应的文本内容生成第二机器学习模型的训练样本;
    通过所述训练样本对所述第二机器学习模型进行训练生成语音识别模型,以基于该语音识别模型将所述语音信息识别为文本信息。
  12. 根据权利要求9所述的电子设备,其中,所述将所述语音信息识别为文本信息包括:
    获取用户终端选定的自定义语音识别集;
    将所述语音信息与所述用户终端选定的自定义语音识别集中包含的语音片段进行比对;
    若所述语音信息与所述用户终端选定的自定义语音识别集中包含的目标语音片段相匹配,则将所述用户终端选定的自定义语音识别集中包含的所述目标语音片段对应的文本信息作为识别到的文本信息。
  13. 根据权利要求9所述的电子设备,其中,在所述将所述文本信息和所述场景信息输入第一机器学习模型,获取由第一机器学习模型输出的优化后的文本信息之前,所述方法还包括:
    获取预先设置的语音信息样本集合中每个语音信息样本对应的文本信息和场景信息;
    确定所述语音信息样本集合中每个语音信息样本对应的优化后的文本信息;
    将获取的所述语音信息样本对应的文本信息和场景信息输入第一机器学习模型,获取由第一机器学习模型输出的优化后的文本信息,将所述第一机器学习模型输出的优化后的文本信息与确定的所述语音信息样本对应的优化后的文本信息进行比对,如不一致,则调整所述第一机器学习模型的参数,直至所述第一机器学习模型输出的优化后的文本信息与确定的所述语音信息样本对应的优化后的文本信息一致。
  14. 根据权利要求9至13中任一项所述的电子设备,其中,在所述获取由第一机器学习模型输出的根据所述场景信息进行优化后的文本信息之后,所述方法还包括:
    将所述优化后的文本信息和所述场景信息输入预设的意图识别模型,获取由所述意图识别模型输出的所述语音信息所包含的意图信息。
  15. 根据权利要求14所述的电子设备,其中,在所述将所述优化后的文本信息和所述场景信息输入预设的意图识别模型之前,所述方法还包括:
    获取预先设置的语音信息样本集合中每个语音信息样本对应的优化后的文本信息和场景信息;
    确定所述语音信息样本对应的意图信息;
    将获取的所述语音信息样本对应的优化后的文本信息和场景信息输入意图识别模型,获取由所述意图识别模型输出的意图信息,将所述意图识别模型输出的意图信息与确定的所述语音信息样本对应的意图信息进行比对,如不一致则调整所述意图识别模型的参数,直至所述意图识别模型输出的意图信息与确定的所述语音样本对应的意图信息相一致。
  16. 一种计算机可读存储介质,其中,其存储有计算机程序指令,当所述计算机指令被计算机执行时,使计算机执行语音识别方法:
    其中,所述语音识别方法包括:
    获取用户终端所处的位置信息;
    基于所述位置信息确定所述用户终端所处的场景信息;
    若检测到用户的语音信息,则将所述语音信息识别为文本信息;
    将所述文本信息与所述场景信息输入第一机器学习模型,获取由第一机器学习模型输出的根据所述场景信息进行优化后的文本信息。
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述基于所述位置信息确定所述用户终端所处的场景信息包括:
    基于所述位置信息,确定所述用户终端所处位置;
    基于所述用户终端所处位置,以及数据库中预存的位置与场景的对应关系,确定所述用户终端所处的场景信息。
  18. 根据权利要求16所述的计算机可读存储介质,其中,在所述将所述语音信息识别为文本信息之前,所述方法还包括:
    接收用户自定义的语音片段,以及与所述语音片段对应的文本内容;
    识别所述语音片段的语素特征,根据所述语素特征和与所述语音片段对应的文本内容生成第二机器学习模型的训练样本;
    通过所述训练样本对所述第二机器学习模型进行训练生成语音识别模型,以基于该语音识别模型将所述语音信息识别为文本信息。
  19. 根据权利要求16所述的计算机可读存储介质,其中,所述将所述语音信息识别为文本信息包括:
    获取用户终端选定的自定义语音识别集;
    将所述语音信息与所述用户终端选定的自定义语音识别集中包含的语音片段进行比对;
    若所述语音信息与所述用户终端选定的自定义语音识别集中包含的目标语音片段相匹配,则将所述用户终端选定的自定义语音识别集中包含的所述目标语音片段对应的文本信息作为识别到的文本信息。
  20. 根据权利要求16所述的计算机可读存储介质,其中,在所述将所述文本信息和所述场景信息输入第一机器学习模型,获取由第一机器学习模型输出的优化后的文本信息之前,所述方法还包括:
    获取预先设置的语音信息样本集合中每个语音信息样本对应的文本信息和场景信息;
    确定所述语音信息样本集合中每个语音信息样本对应的优化后的文本信息;
    将获取的所述语音信息样本对应的文本信息和场景信息输入第一机器学习模型,获取由第一机器学习模型输出的优化后的文本信息,将所述第一机器学习模型输出的优化后的文本信息与确定的所述语音信息样本对应的优化后的文本信息进行比对,如不一致,则调整所述第一机器学习模型的参数,直至所述第一机器学习模型输出的优化后的文本信息与确定的所述语音信息样本对应的优化后的文本信息一致。
PCT/CN2020/087471 2019-05-22 2020-04-28 语音识别的方法、装置、电子设备和存储介质 WO2020233363A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910430228.7A CN110349575A (zh) 2019-05-22 2019-05-22 语音识别的方法、装置、电子设备和存储介质
CN201910430228.7 2019-05-22

Publications (1)

Publication Number Publication Date
WO2020233363A1 true WO2020233363A1 (zh) 2020-11-26

Family

ID=68173954

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/087471 WO2020233363A1 (zh) 2019-05-22 2020-04-28 语音识别的方法、装置、电子设备和存储介质

Country Status (2)

Country Link
CN (1) CN110349575A (zh)
WO (1) WO2020233363A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115457979A (zh) * 2022-09-22 2022-12-09 赵显阳 一种视频语音分析识别处理方法及系统

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110349575A (zh) * 2019-05-22 2019-10-18 深圳壹账通智能科技有限公司 语音识别的方法、装置、电子设备和存储介质
CN110956955B (zh) * 2019-12-10 2022-08-05 思必驰科技股份有限公司 一种语音交互的方法和装置
CN112309387A (zh) * 2020-02-26 2021-02-02 北京字节跳动网络技术有限公司 用于处理信息的方法和装置
CN112259083B (zh) * 2020-10-16 2024-02-13 北京猿力未来科技有限公司 音频处理方法及装置
CN112786055A (zh) * 2020-12-25 2021-05-11 北京百度网讯科技有限公司 资源挂载方法、装置、设备、存储介质及计算机程序产品
CN114694645A (zh) * 2020-12-31 2022-07-01 华为技术有限公司 一种确定用户意图的方法及装置
CN118246910A (zh) * 2024-05-28 2024-06-25 国网山东省电力公司营销服务中心(计量中心) 一种对话式线上缴费方法、系统、介质、设备及程序产品

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104240698A (zh) * 2014-09-24 2014-12-24 上海伯释信息科技有限公司 一种语音识别的方法
US20150120618A1 (en) * 2013-10-25 2015-04-30 Samsung Electronics Co., Ltd. Artificial intelligence audio apparatus and operation method thereof
CN105448292A (zh) * 2014-08-19 2016-03-30 北京羽扇智信息科技有限公司 一种基于场景的实时语音识别系统和方法
CN105719649A (zh) * 2016-01-19 2016-06-29 百度在线网络技术(北京)有限公司 语音识别方法及装置
CN106683662A (zh) * 2015-11-10 2017-05-17 中国电信股份有限公司 一种语音识别方法和装置
CN109509473A (zh) * 2019-01-28 2019-03-22 维沃移动通信有限公司 语音控制方法及终端设备
CN110349575A (zh) * 2019-05-22 2019-10-18 深圳壹账通智能科技有限公司 语音识别的方法、装置、电子设备和存储介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150120618A1 (en) * 2013-10-25 2015-04-30 Samsung Electronics Co., Ltd. Artificial intelligence audio apparatus and operation method thereof
CN105448292A (zh) * 2014-08-19 2016-03-30 北京羽扇智信息科技有限公司 一种基于场景的实时语音识别系统和方法
CN104240698A (zh) * 2014-09-24 2014-12-24 上海伯释信息科技有限公司 一种语音识别的方法
CN106683662A (zh) * 2015-11-10 2017-05-17 中国电信股份有限公司 一种语音识别方法和装置
CN105719649A (zh) * 2016-01-19 2016-06-29 百度在线网络技术(北京)有限公司 语音识别方法及装置
CN109509473A (zh) * 2019-01-28 2019-03-22 维沃移动通信有限公司 语音控制方法及终端设备
CN110349575A (zh) * 2019-05-22 2019-10-18 深圳壹账通智能科技有限公司 语音识别的方法、装置、电子设备和存储介质

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115457979A (zh) * 2022-09-22 2022-12-09 赵显阳 一种视频语音分析识别处理方法及系统

Also Published As

Publication number Publication date
CN110349575A (zh) 2019-10-18

Similar Documents

Publication Publication Date Title
WO2020233363A1 (zh) 语音识别的方法、装置、电子设备和存储介质
US11887604B1 (en) Speech interface device with caching component
WO2021093449A1 (zh) 基于人工智能的唤醒词检测方法、装置、设备及介质
CN108520743B (zh) 智能设备的语音控制方法、智能设备及计算机可读介质
US11817094B2 (en) Automatic speech recognition with filler model processing
US11087739B1 (en) On-device learning in a hybrid speech processing system
CN108133707B (zh) 一种内容分享方法及系统
KR102201937B1 (ko) 후속 음성 쿼리 예측
CN107134279A (zh) 一种语音唤醒方法、装置、终端和存储介质
US8532992B2 (en) System and method for standardized speech recognition infrastructure
CN114830228A (zh) 与设备关联的账户
CN110047481B (zh) 用于语音识别的方法和装置
US11830482B2 (en) Method and apparatus for speech interaction, and computer storage medium
WO2020024620A1 (zh) 语音信息的处理方法以及装置、设备和存储介质
CN111916088B (zh) 一种语音语料的生成方法、设备及计算机可读存储介质
CN111462741B (zh) 语音数据处理方法、装置及存储介质
CN108055617B (zh) 一种麦克风的唤醒方法、装置、终端设备及存储介质
JP2022037100A (ja) 車載機器の音声処理方法、装置、機器及び記憶媒体
CN106649253B (zh) 基于后验证的辅助控制方法及系统
CN111833857B (zh) 语音处理方法、装置和分布式系统
JP7308335B2 (ja) 車載音声機器のテスト方法、装置、電子機器及び記憶媒体
CN103514882A (zh) 一种语音识别方法及系统
JP2022101663A (ja) ヒューマンコンピュータインタラクション方法、装置、電子機器、記憶媒体およびコンピュータプログラム
US20240013784A1 (en) Speaker recognition adaptation
CN111091819A (zh) 语音识别装置和方法、语音交互系统和方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20810171

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20810171

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 22.03.2022)

122 Ep: pct application non-entry in european phase

Ref document number: 20810171

Country of ref document: EP

Kind code of ref document: A1