WO2020233363A1 - 语音识别的方法、装置、电子设备和存储介质 - Google Patents
语音识别的方法、装置、电子设备和存储介质 Download PDFInfo
- Publication number
- WO2020233363A1 WO2020233363A1 PCT/CN2020/087471 CN2020087471W WO2020233363A1 WO 2020233363 A1 WO2020233363 A1 WO 2020233363A1 CN 2020087471 W CN2020087471 W CN 2020087471W WO 2020233363 A1 WO2020233363 A1 WO 2020233363A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- information
- voice
- text
- machine learning
- user terminal
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 58
- 238000010801 machine learning Methods 0.000 claims abstract description 85
- 238000012549 training Methods 0.000 claims description 18
- 239000012634 fragment Substances 0.000 claims description 12
- 238000004590 computer program Methods 0.000 claims description 2
- 238000012545 processing Methods 0.000 description 13
- 238000010586 diagram Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 5
- 230000003287 optical effect Effects 0.000 description 4
- 241001672694 Citrus reticulata Species 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 230000001133 acceleration Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/20—Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
Definitions
- This application relates to the field of artificial intelligence biometrics, in particular to a method, device, electronic device and storage medium for speech recognition.
- the commonly used voice recognition method is to extract the characteristics of the voice information to be recognized by the user, and then recognize the voice information to be recognized by the user according to the recognition algorithm.
- the inventor realizes that in some scenes (such as the road), the voice to be recognized captured by the voice recognition function contains not only the voice of a person, but also noises such as car whistle.
- the voice information of the person is valid.
- the voice to be recognized, and the noise such as car whistle sound is noise, and the noise such as car whistle will also be recognized during speech recognition, which leads to insufficient speech recognition accuracy.
- the purpose of the embodiments of the present application is to provide a voice recognition method, device, computer readable medium, and electronic equipment, which can overcome the problem of low voice recognition accuracy in the prior art at least to a certain extent.
- a voice recognition method including: obtaining location information of a user terminal; determining the scene information in which the user terminal is located based on the location information; if the user's voice is detected Information, the voice information is recognized as text information; the text information and the scene information are input into the first machine learning model, and the text information output by the first machine learning model optimized according to the scene information is obtained .
- a speech recognition device including: a first acquisition module: used to acquire location information of a user terminal; a determination module: used to determine where the user terminal is located based on the location information Recognition module: used to recognize the voice information as text information if the user’s voice information is detected; second acquisition module: used to input the text information and the scene information into the first machine learning
- the model obtains text information optimized according to the scene information output by the first machine learning model.
- an electronic device for speech recognition including: a memory configured to store executable instructions; a processor configured to execute the executable instructions stored in the memory to perform a voice recognition method:
- the voice recognition method includes:
- the text information and the scene information are input into a first machine learning model, and the text information output by the first machine learning model optimized according to the scene information is obtained.
- a computer-readable storage medium which stores computer program instructions, and when the computer instructions are executed by the computer, the computer executes the voice recognition method:
- the voice recognition method includes:
- the text information and the scene information are input into a first machine learning model, and the text information output by the first machine learning model optimized according to the scene information is obtained.
- the location of the user terminal is determined based on the location information, and the location of the user terminal is compared with the location pre-stored in the database.
- the scene correspondence table determines the scene information in which the user terminal device is located, and when the user terminal detects that voice information is input, the voice information is recognized as text information, and the text information is combined with the scene information Input the preset first machine learning model, and the first machine learning model filters out the text information corresponding to the noise corresponding to the scene information contained in the text information to obtain optimized text information. It can be seen that the embodiments of the present application can quickly and accurately filter out the text information corresponding to the noise corresponding to the scene contained in the text information corresponding to the speech to be recognized; thereby improving the accuracy of speech recognition.
- Fig. 1 shows a system architecture diagram of a use environment of a voice recognition method according to an exemplary embodiment of the present application.
- Fig. 2 shows a flowchart of a voice recognition method according to an exemplary embodiment of the present disclosure.
- Fig. 3 shows a detailed flowchart of determining the scene information in which the user terminal is located based on the location information according to an exemplary embodiment of the present disclosure.
- Fig. 4 shows a flowchart before recognizing the voice information of the user as text information if the voice information of the user is detected according to an exemplary embodiment of the present disclosure.
- FIG. 5 shows a detailed flowchart of recognizing the voice information as text information if the voice information of the user is detected according to an exemplary embodiment of the present disclosure.
- Fig. 6 shows a flowchart before inputting the text information and the scene information into a first machine learning model to obtain optimized text information output by the first machine learning model according to an exemplary embodiment of the present disclosure.
- Fig. 7 shows a flowchart after obtaining text information optimized according to the scene information output by the first machine learning model according to an exemplary embodiment of the present disclosure.
- Fig. 8 shows a flowchart before inputting the optimized text information and the scene information into a preset intention recognition model according to an exemplary embodiment of the present disclosure.
- Figure 9 shows a structural block diagram of a speech recognition device according to an exemplary embodiment of the present disclosure
- Fig. 10 shows a diagram of an electronic device for voice recognition according to an exemplary embodiment of the present disclosure.
- Fig. 11 shows a computer-readable storage medium diagram for speech recognition according to an exemplary embodiment of the present disclosure.
- FIG. 1 shows a framework diagram of a use environment of a speech recognition method according to an exemplary embodiment of the present disclosure:
- the use environment includes a user terminal 110, a server 120, and a database 130.
- the numbers of user terminals, servers, and databases in FIG. 1 are merely illustrative. According to implementation needs, there can be any number of user terminals, servers and databases.
- the server 120 may be a server cluster composed of multiple servers.
- the server 120 obtains the location information of the user terminal 110 through the global positioning system (GPS) built into the user terminal 110.
- the server 120 determines the location of the user terminal 110 based on the location information of the user terminal 110.
- 120 Determine the scene information corresponding to the user terminal 110 based on the location of the user terminal 110 and the corresponding relationship between the location and the scene information pre-stored in the database 130.
- the user terminal 110 detects a user voice input, the user terminal 110 will obtain the user voice information
- the server 120 recognizes the voice information sent by the user terminal 110 as text information.
- the server 120 corresponds to the voice information sent by the user terminal based on the scene information corresponding to the user terminal and the text information corresponding to the voice information sent by the user terminal 110.
- the text information corresponding to the noise corresponding to the scene where the user terminal 110 is located in the text information is filtered out, and the optimized text information is output, thereby improving the accuracy of speech recognition.
- the data processing method provided by the embodiment of the present application is generally executed by the server 120, and correspondingly, the data processing device is generally set in the server 120.
- the terminal and the server may also have similar functions, so as to execute the data processing solution provided by the embodiments of the present application.
- Fig. 2 shows a flowchart of a speech recognition method according to an exemplary embodiment of the present disclosure, which may include the following steps:
- Step S200 Obtain location information where the user terminal is located
- Step S210 Determine the scene information where the user terminal is located based on the location information
- Step S220 If the user's voice information is detected, recognize the voice information as text information;
- Step S230 Input the text information and the scene information into the first machine learning model, and obtain the text information output by the first machine learning model optimized according to the scene information.
- step S200 obtain location information where the user terminal is located.
- the location information refers to information indicating a place or address.
- the location of the user terminal is determined by obtaining the location information of the user terminal, and then the scene information and the scene corresponding to the user terminal are determined.
- the noise corresponding to the information so that the subsequent first machine learning model can filter out the text information corresponding to the noise in the text information, and improve the accuracy of speech recognition.
- the location information may be positioning information obtained through a GPS module built into the user terminal device, or may be text information indicating the location input by the owner of the user terminal device when using the user terminal device.
- step S210 determine the scene information where the user terminal is located based on the location information.
- the scene information refers to information indicating the scene in which the user is located.
- the position of the user terminal is determined by the position information of the user, and then the scene in which the user terminal is located is determined, based on the The user terminal is located in the scene to determine the possible noise in the voice information obtained through the user terminal device and the text information corresponding to the noise.
- step S210 may include:
- Step S2101 Determine the location of the user terminal based on the location information
- Step S2102 Determine the scene information where the user terminal is located based on the location of the user terminal and the corresponding relationship between the location and the scene pre-stored in the database.
- the location information of the user terminal is obtained by acquiring the location information of the user terminal through the built-in GPS module of the user terminal. Based on the GPS location information, it is determined that the location of the user terminal is "the intersection of College Road and Chuangxin Road”. , Based on the location of the user terminal "Intersection of Xueyuan Road and Chuangxin Road", query the pre-stored map in the database indicating the scene information, and determine the location of the user terminal corresponding to the "Intersection of College Road and Chuangxin Road” The scene is "by the road”.
- the location information of the user terminal may also be input by the owner of the user terminal device when installing and placing the user terminal device.
- step S220 if the user's voice information is detected, the voice information is recognized as text information.
- the decibel value corresponding to the user voice information is determined, and based on the judgment result of the decibel value corresponding to the user voice information, it is determined whether to recognize the user voice to prevent
- the user terminal is in a noisy scene, the non-target user’s voice is recognized as text information, thereby further reducing the accuracy of the target user’s voice recognition, and at the same time, the user terminal does not detect the voice within the preset decibel range It is always in the standby state, reducing the power consumption of the user terminal to save resources.
- the second machine learning model can be used to recognize the speech information as text information.
- the machine learning model needs to be trained in advance.
- the specific training process is shown in Figure 4 and can include the following steps:
- Step S410 Receive a user-defined voice segment and text content corresponding to the voice segment
- Step S420 Recognizing the morpheme feature of the voice segment, and generating a training sample of the second machine learning model according to the morpheme feature and the text content corresponding to the voice segment;
- Step S430 Train the second machine learning model through the training sample to generate a voice recognition model, so as to recognize the voice information as text information based on the voice recognition model.
- the speech information input by the user can be recognized, which can improve the accuracy of the input
- the accuracy of speech recognition can meet the individual needs of the user.
- the acquired user-defined voice segment is "you"
- the corresponding text information is the custom voice segment such as "none” and the corresponding text information
- the acquired custom voice segment and the corresponding text information are used as
- the sample trains the second machine learning model to generate a voice recognition model.
- the voice recognition model will recognize "you” as the corresponding text information "none” to meet the user's personalized needs .
- the user-defined voice fragments obtained are voices with strong local dialects such as "nongshalei", and the corresponding text information is Mandarin text information corresponding to local dialects such as "What are you doing?"
- the fragments and the corresponding text information are used as training samples to train the second machine learning model to generate a speech recognition model.
- the acquired user voice is a local dialect such as "nongshalei”
- the local dialect voice will be recognized as the corresponding "What are you doing?" "And other Mandarin text information to improve the accuracy of the recognition of different dialects.
- the voice information may be recognized as text information through the following process:
- Step S2201 Obtain a customized speech recognition set selected by the user terminal
- Step S2202 Compare the voice information with the voice fragments included in the customized voice recognition set selected by the user terminal;
- Step S2203 If the voice information matches the target voice segment included in the customized voice recognition set selected by the user terminal, then the target voice segment included in the customized voice recognition set selected by the user terminal is selected The corresponding text information is used as the recognized text information.
- the text information corresponding to the obtained user voice can be determined, which can reduce the diversity of different dialects in the same language due to the complexity of the language
- the difficulty of speech recognition will improve the accuracy of speech recognition.
- it can also improve the accuracy of speech recognition for some users who like to mix different languages or local dialects.
- the acquired user voice is "woverydiao"
- the acquired user voice "woverydiao” is a mixture of Chinese, English and local dialects. It is difficult for the current common voice recognition model to recognize the acquired user voice.
- the target voice segment that is the same as the acquired user voice, and the target voice segment corresponds to If the text information is "I am strong”, then the text information "I am strong” corresponding to the target speech segment is recognized as the text information of the user voice "wovreydiao" that should be obtained, so as to meet the personalized needs of different users and Improve the accuracy of the user's voice recognition.
- step S230 input the text information and the scene information into the first machine learning model, and obtain the text information output by the first machine learning model optimized according to the scene information.
- the text information is optimized by the machine learning model to filter out the noise contained in the text information, which can improve the accuracy of speech recognition and can quickly process a large amount of speech information in a short time. data.
- the optimized text information can also be obtained by determining the noise text information corresponding to the scene information based on the scene information corresponding to the user terminal, and corresponding the scene information contained in the text information corresponding to the acquired voice information The noise text information is filtered out, thereby improving the accuracy of speech recognition.
- the optimized text information may be obtained by the first machine learning model based on the text information and the scene information.
- the machine learning model needs to be trained in advance.
- the specific training process is as follows: As shown in Figure 6, it includes the following steps:
- Step S610 Acquire text information and scene information corresponding to each voice information sample in the preset voice information sample set
- Step S620 Determine the optimized text information corresponding to each voice information sample in the voice information sample set
- Step S630 Input the obtained text information and scene information corresponding to the voice information sample into the first machine learning model, obtain the optimized text information output by the first machine learning model, and output the text information output by the first machine learning model.
- the optimized text information is compared with the optimized text information corresponding to the determined voice information sample. If they are inconsistent, the parameters of the first machine learning model are adjusted until the output of the first machine learning model is optimized The latter text information is consistent with the optimized text information corresponding to the determined voice information sample.
- the voice recognition method provided in the embodiment of the present application may further include the following steps.
- Step S240 Input the optimized text information and the scene information into a preset intention recognition model, and obtain the intention information included in the voice information output by the intention recognition model.
- the intention information refers to the needs and purpose of voice expression.
- the technical solution of the embodiment shown in FIG. 7 can determine the user terminal by obtaining the scene information where the user terminal is located and the text information corresponding to the voice obtained by the user terminal.
- the intent information corresponding to the acquired voice information is then executed according to the intent information corresponding to the voice information acquired by the user terminal.
- the intent information can be obtained by means of an intent recognition model based on the optimized text information and the scene information.
- the intent recognition model needs to be trained in advance.
- the specific training process is shown in the figure As shown in 8, the following steps can be included::
- Step S810 Obtain optimized text information and scene information corresponding to each voice information sample in the preset voice information sample set;
- Step S820 Determine the intent information corresponding to the voice information sample
- Step S830 Input the optimized text information and scene information corresponding to the obtained voice information sample into the intent recognition model, obtain the intent information output by the intent recognition model, and determine the intent information output by the intent recognition model.
- the intent information corresponding to the voice information sample is compared, and if they are inconsistent, the parameters of the intent recognition model are adjusted until the intent information output by the intent recognition model is consistent with the determined intent information corresponding to the voice sample.
- the apparatus 900 for speech recognition includes: a first acquisition module 910, a determination module 920, an identification module 930, and a second acquisition module 940, wherein:
- the first obtaining module 910 is configured to obtain location information where the user terminal is located;
- the determining module 920 is configured to determine the scene information where the user terminal is located based on the location information;
- the recognition module 930 is configured to recognize the voice information as text information if the voice information of the user is detected;
- the second obtaining module 940 is configured to input the text information and the scene information into a first machine learning model, and obtain the text information output by the first machine learning model optimized according to the scene information.
- the determining module 920 may also be configured to determine the location of the user terminal based on the location information, based on the location of the user terminal, and the correspondence between the location and the scene pre-stored in the database Relationship, determine the scene information where the user terminal is located.
- the speech recognition device further includes: a second machine learning model training module, configured to receive a user-defined speech segment and text content corresponding to the speech segment, and recognize the morphemes of the speech segment Feature, generate a training sample of a second machine learning model based on the morpheme feature and the text content corresponding to the speech segment, and train the second machine learning model to generate a speech recognition model based on the training sample
- the voice recognition model recognizes the voice information as text information.
- the recognition module 930 may also be configured to obtain a customized voice recognition set selected by the user terminal, and combine the voice information with the customized voice recognition set selected by the user terminal. The voice fragments are compared, and if the voice information matches the target voice fragments included in the customized voice recognition set selected by the user terminal, then the voice fragments included in the customized voice recognition set selected by the user terminal are compared. The text information corresponding to the target speech segment is used as the recognized text information.
- the voice recognition device further includes: a first machine learning model training module, configured to obtain text information and scene information corresponding to each voice information sample in the preset voice information sample set, and determine the voice
- the optimized text information corresponding to each voice information sample in the information sample set, the text information and scene information corresponding to the obtained voice information sample are input into the first machine learning model, and the optimized text information output by the first machine learning model is obtained Compare the optimized text information output by the first machine learning model with the optimized text information corresponding to the determined voice information sample, and if they are inconsistent, adjust the first machine learning model Until the optimized text information output by the first machine learning model is consistent with the optimized text information corresponding to the determined voice information sample.
- the speech recognition device further includes: an intention recognition module configured to obtain optimized text information and scene information corresponding to each speech information sample in the preset speech information sample set.
- the speech recognition module further includes: an intention recognition model training module, configured to obtain optimized text information and scene information corresponding to each speech information sample in the preset speech information sample set, and determine the speech
- the intent information corresponding to the information sample, the optimized text information and scene information corresponding to the obtained voice information sample are input into the intent recognition model, the intent information output by the intent recognition model is obtained, and the intent recognition model is output
- the intention information is compared with the determined intention information corresponding to the voice information sample, and if they are inconsistent, the parameters of the intention recognition model are adjusted until the intention information output by the intention recognition model matches the determined intention corresponding to the speech sample The information is consistent.
- modules or units of the device for action execution are mentioned in the above detailed description, this division is not mandatory.
- the features and functions of two or more modules or units described above may be embodied in one module or unit.
- the features and functions of a module or unit described above can be further divided into multiple modules or units to be embodied.
- the exemplary embodiments described herein can be implemented by software, or can be implemented by combining software with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure can be embodied in the form of a software product, and the software product can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.) or a network
- a non-volatile storage medium which can be a CD-ROM, U disk, mobile hard disk, etc.
- the above includes several instructions to make a computing device (which may be a personal computer, server, mobile terminal, or network device, etc.) execute the method according to the embodiment of the present disclosure.
- the electronic device 1000 according to this embodiment of the present application will be described below with reference to FIG. 10.
- the electronic device 1000 shown in FIG. 0 is only an example, and should not bring any limitation to the function and scope of use of the embodiments of the present application.
- the electronic device 1000 is represented in the form of a general-purpose computing device.
- the components of the electronic device 1000 may include but are not limited to: the aforementioned at least one processing unit 1010, the aforementioned at least one storage unit 1020, and a bus 1030 connecting different system components (including the storage unit 1020 and the processing unit 1010).
- the storage unit stores program code, and the program code can be executed by the processing unit 1010, so that the processing unit 1010 executes the various exemplary methods described in the “exemplary method” section of this specification.
- the processing unit 1010 may perform step S200 as shown in FIG. 2: obtain the location information of the user terminal; step S210: determine the scene information where the user terminal is located based on the location information; step S220: if If the user’s voice information is detected, the voice information is recognized as text information; step S230: the text information and the scene information are input into the first machine learning model, and the output of the first machine learning model is obtained based on the Text information after scene information is optimized;
- the storage unit 1020 may include a readable medium in the form of a volatile storage unit, such as a random access storage unit (RAM) 10201 and/or a cache storage unit 10202, and may further include a read-only storage unit (ROM) 10203.
- RAM random access storage unit
- ROM read-only storage unit
- the storage unit 1020 may also include a program/utility tool 10204 having a set (at least one) program module 10205.
- program module 10205 includes but is not limited to: an operating system, one or more application programs, other program modules, and program data, Each of these examples or some combination may include the implementation of a network environment.
- the bus 1030 may represent one or more of several types of bus structures, including a storage unit bus or a storage unit controller, a peripheral bus, a graphics acceleration port, a processing unit, or a local area using any bus structure among multiple bus structures. bus.
- the electronic device 1000 may also communicate with one or more external devices 500 (such as keyboards, pointing devices, Bluetooth devices, etc.), and may also communicate with one or more devices that enable a user to interact with the electronic device 1000, and/or communicate with Any device (such as a router, modem, etc.) that enables the electronic device 1000 to communicate with one or more other computing devices. This communication can be performed through an input/output (I/O) interface 1050. Moreover, the electronic device 1000 may also communicate with one or more networks (for example, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet) through the network adapter 1060.
- LAN local area network
- WAN wide area network
- public network such as the Internet
- the network adapter 1060 communicates with other modules of the electronic device 1000 through the bus 1030. It should be understood that although not shown in the figure, other hardware and/or software modules can be used in conjunction with the electronic device 1000, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives And data backup storage system, etc.
- the exemplary embodiments described herein can be implemented by software, or can be implemented by combining software with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , Including several instructions to make a computing device (which may be a personal computer, server, terminal device, or network device, etc.) execute the method according to the embodiment of the present application.
- a non-volatile storage medium can be a CD-ROM, U disk, mobile hard disk, etc.
- Including several instructions to make a computing device which may be a personal computer, server, terminal device, or network device, etc.
- a computer-readable storage medium is also provided.
- the computer-readable storage medium is a volatile storage medium or a non-volatile storage medium.
- the program product of the above method In some possible implementation manners, various aspects of the present application can also be implemented in the form of a program product, which includes program code.
- the program product runs on a terminal device, the program code is used to enable the The terminal device executes the steps according to various exemplary embodiments of the present application described in the above-mentioned "Exemplary Method" section of this specification.
- a program product 1100 for implementing the above method according to an embodiment of the present application is described. It can adopt a portable compact disk read-only memory (CD-ROM) and include program code, and can be installed in a terminal device For example, running on a personal computer.
- CD-ROM compact disk read-only memory
- the program product of this application is not limited to this.
- the readable storage medium can be any tangible medium that contains or stores a program, and the program can be used by or combined with an instruction execution system, device, or device.
- the program product can use any combination of one or more readable media.
- the readable medium may be a readable signal medium or a readable storage medium.
- the readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or a combination of any of the above. More specific examples (non-exhaustive list) of readable storage media include: electrical connections with one or more wires, portable disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Type programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
- the computer-readable signal medium may include a data signal propagated in baseband or as a part of a carrier wave, and readable program code is carried therein. This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
- the readable signal medium may also be any readable medium other than a readable storage medium, and the readable medium may send, propagate, or transmit a program for use by or in combination with the instruction execution system, apparatus, or device.
- the program code contained on the readable medium can be transmitted by any suitable medium, including but not limited to wireless, wired, optical cable, RF, etc., or any suitable combination of the foregoing.
- the program code used to perform the operations of this application can be written in any combination of one or more programming languages.
- the programming languages include object-oriented programming languages—such as Java, C++, etc., as well as conventional procedural Programming language-such as "C" language or similar programming language.
- the program code can be executed entirely on the user's computing device, partly on the user's device, executed as an independent software package, partly on the user's computing device and partly executed on the remote computing device, or entirely on the remote computing device or server Executed on.
- the remote computing device can be connected to a user computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or can be connected to an external computing device (for example, using Internet service providers) Business to connect via the Internet).
- LAN local area network
- WAN wide area network
- Internet service providers Internet service providers
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Telephonic Communication Services (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims (20)
- 一种语音识别方法,其中,所述方法包括:获取用户终端所处的位置信息;基于所述位置信息确定所述用户终端所处的场景信息;若检测到用户的语音信息,则将所述语音信息识别为文本信息;将所述文本信息与所述场景信息输入第一机器学习模型,获取由第一机器学习模型输出的根据所述场景信息进行优化后的文本信息。
- 根据权利要求1所述的方法,其中,所述基于所述位置信息确定所述用户终端所处的场景信息包括:基于所述位置信息,确定所述用户终端所处位置;基于所述用户终端所处位置,以及数据库中预存的位置与场景的对应关系,确定所述用户终端所处的场景信息。
- 根据权利要求1所述的方法,其中,在所述将所述语音信息识别为文本信息之前,所述方法还包括:接收用户自定义的语音片段,以及与所述语音片段对应的文本内容;识别所述语音片段的语素特征,根据所述语素特征和与所述语音片段对应的文本内容生成第二机器学习模型的训练样本;通过所述训练样本对所述第二机器学习模型进行训练生成语音识别模型,以基于该语音识别模型将所述语音信息识别为文本信息。
- 根据权利要求1所述的方法,其中,所述将所述语音信息识别为文本信息包括:获取用户终端选定的自定义语音识别集;将所述语音信息与所述用户终端选定的自定义语音识别集中包含的语音片段进行比对;若所述语音信息与所述用户终端选定的自定义语音识别集中包含的目标语音片段相匹配,则将所述用户终端选定的自定义语音识别集中包含的所述目标语音片段对应的文本信息作为识别到的文本信息。
- 根据权利要求1所述的方法,其中,在所述将所述文本信息和所述场景信息输入第一机器学习模型,获取由第一机器学习模型输出的优化后的文本信息之前,所述方法还包括:获取预先设置的语音信息样本集合中每个语音信息样本对应的文本信息和场景信息;确定所述语音信息样本集合中每个语音信息样本对应的优化后的文本信息;将获取的所述语音信息样本对应的文本信息和场景信息输入第一机器学习模型,获取由第一机器学习模型输出的优化后的文本信息,将所述第一机器学习模型输出的优化后的文本信息与确定的所述语音信息样本对应的优化后的文本信息进行比对,如不一致,则调整所述第一机器学习模型的参数,直至所述第一机器学习模型输出的优化后的文本信息与确定的所述语音信息样本对应的优化后的文本信息一致。
- 根据权利要求1至5中任一项所述的方法,其中,在所述获取由第一机器学习模型输出的根据所述场景信息进行优化后的文本信息之后,所述方法还包括:将所述优化后的文本信息和所述场景信息输入预设的意图识别模型,获取由所述意图识别模型输出的所述语音信息所包含的意图信息。
- 根据权利要求6所述的方法,其中,在所述将所述优化后的文本信息和所述场景信息输入预设的意图识别模型之前,所述方法还包括:获取预先设置的语音信息样本集合中每个语音信息样本对应的优化后的文本信息和场景信息;确定所述语音信息样本对应的意图信息;将获取的所述语音信息样本对应的优化后的文本信息和场景信息输入意图识别模型,获取由所述意图识别模型输出的意图信息,将所述意图识别模型输出的意图信息与确定的所述语音信息样本对应的意图信息进行比对,如不一致则调整所述意图识别模型的参数,直至所述意图识别模型输出的意图信息与确定的所述语音样本对应的意图信息相一致。
- 一种语音识别装置,其中,包括:第一获取模块:用于获取用户终端所处的位置信息;确定模块:用于基于所述位置信息确定所述用户终端所处的场景信息;识别模块:用于若检测到用户的语音信息,则将所述语音信息识别为文本信息;第二获取模块:用于将所述文本信息与所述场景信息输入第一机器学习模型,获取由第一机器学习模型输出的根据所述场景信息进行优化后的文本信息。
- 一种语音识别的电子设备,其中,包括:存储器,配置为存储可执行指令;处理器,配置为执行存储器中存储的可执行指令,以实现一种语音识别方法:其中,所述语音识别方法包括:获取用户终端所处的位置信息;基于所述位置信息确定所述用户终端所处的场景信息;若检测到用户的语音信息,则将所述语音信息识别为文本信息;将所述文本信息与所述场景信息输入第一机器学习模型,获取由第一机器学习模型输出的根据所述场景信息进行优化后的文本信息。
- 根据权利要求9所述的电子设备,其中,所述基于所述位置信息确定所述用户终端所处的场景信息包括:基于所述位置信息,确定所述用户终端所处位置;基于所述用户终端所处位置,以及数据库中预存的位置与场景的对应关系,确定所述用户终端所处的场景信息。
- 根据权利要求9所述的电子设备,其中,在所述将所述语音信息识别为文本信息之前,所述方法还包括:接收用户自定义的语音片段,以及与所述语音片段对应的文本内容;识别所述语音片段的语素特征,根据所述语素特征和与所述语音片段对应的文本内容生成第二机器学习模型的训练样本;通过所述训练样本对所述第二机器学习模型进行训练生成语音识别模型,以基于该语音识别模型将所述语音信息识别为文本信息。
- 根据权利要求9所述的电子设备,其中,所述将所述语音信息识别为文本信息包括:获取用户终端选定的自定义语音识别集;将所述语音信息与所述用户终端选定的自定义语音识别集中包含的语音片段进行比对;若所述语音信息与所述用户终端选定的自定义语音识别集中包含的目标语音片段相匹配,则将所述用户终端选定的自定义语音识别集中包含的所述目标语音片段对应的文本信息作为识别到的文本信息。
- 根据权利要求9所述的电子设备,其中,在所述将所述文本信息和所述场景信息输入第一机器学习模型,获取由第一机器学习模型输出的优化后的文本信息之前,所述方法还包括:获取预先设置的语音信息样本集合中每个语音信息样本对应的文本信息和场景信息;确定所述语音信息样本集合中每个语音信息样本对应的优化后的文本信息;将获取的所述语音信息样本对应的文本信息和场景信息输入第一机器学习模型,获取由第一机器学习模型输出的优化后的文本信息,将所述第一机器学习模型输出的优化后的文本信息与确定的所述语音信息样本对应的优化后的文本信息进行比对,如不一致,则调整所述第一机器学习模型的参数,直至所述第一机器学习模型输出的优化后的文本信息与确定的所述语音信息样本对应的优化后的文本信息一致。
- 根据权利要求9至13中任一项所述的电子设备,其中,在所述获取由第一机器学习模型输出的根据所述场景信息进行优化后的文本信息之后,所述方法还包括:将所述优化后的文本信息和所述场景信息输入预设的意图识别模型,获取由所述意图识别模型输出的所述语音信息所包含的意图信息。
- 根据权利要求14所述的电子设备,其中,在所述将所述优化后的文本信息和所述场景信息输入预设的意图识别模型之前,所述方法还包括:获取预先设置的语音信息样本集合中每个语音信息样本对应的优化后的文本信息和场景信息;确定所述语音信息样本对应的意图信息;将获取的所述语音信息样本对应的优化后的文本信息和场景信息输入意图识别模型,获取由所述意图识别模型输出的意图信息,将所述意图识别模型输出的意图信息与确定的所述语音信息样本对应的意图信息进行比对,如不一致则调整所述意图识别模型的参数,直至所述意图识别模型输出的意图信息与确定的所述语音样本对应的意图信息相一致。
- 一种计算机可读存储介质,其中,其存储有计算机程序指令,当所述计算机指令被计算机执行时,使计算机执行语音识别方法:其中,所述语音识别方法包括:获取用户终端所处的位置信息;基于所述位置信息确定所述用户终端所处的场景信息;若检测到用户的语音信息,则将所述语音信息识别为文本信息;将所述文本信息与所述场景信息输入第一机器学习模型,获取由第一机器学习模型输出的根据所述场景信息进行优化后的文本信息。
- 根据权利要求16所述的计算机可读存储介质,其中,所述基于所述位置信息确定所述用户终端所处的场景信息包括:基于所述位置信息,确定所述用户终端所处位置;基于所述用户终端所处位置,以及数据库中预存的位置与场景的对应关系,确定所述用户终端所处的场景信息。
- 根据权利要求16所述的计算机可读存储介质,其中,在所述将所述语音信息识别为文本信息之前,所述方法还包括:接收用户自定义的语音片段,以及与所述语音片段对应的文本内容;识别所述语音片段的语素特征,根据所述语素特征和与所述语音片段对应的文本内容生成第二机器学习模型的训练样本;通过所述训练样本对所述第二机器学习模型进行训练生成语音识别模型,以基于该语音识别模型将所述语音信息识别为文本信息。
- 根据权利要求16所述的计算机可读存储介质,其中,所述将所述语音信息识别为文本信息包括:获取用户终端选定的自定义语音识别集;将所述语音信息与所述用户终端选定的自定义语音识别集中包含的语音片段进行比对;若所述语音信息与所述用户终端选定的自定义语音识别集中包含的目标语音片段相匹配,则将所述用户终端选定的自定义语音识别集中包含的所述目标语音片段对应的文本信息作为识别到的文本信息。
- 根据权利要求16所述的计算机可读存储介质,其中,在所述将所述文本信息和所述场景信息输入第一机器学习模型,获取由第一机器学习模型输出的优化后的文本信息之前,所述方法还包括:获取预先设置的语音信息样本集合中每个语音信息样本对应的文本信息和场景信息;确定所述语音信息样本集合中每个语音信息样本对应的优化后的文本信息;将获取的所述语音信息样本对应的文本信息和场景信息输入第一机器学习模型,获取由第一机器学习模型输出的优化后的文本信息,将所述第一机器学习模型输出的优化后的文本信息与确定的所述语音信息样本对应的优化后的文本信息进行比对,如不一致,则调整所述第一机器学习模型的参数,直至所述第一机器学习模型输出的优化后的文本信息与确定的所述语音信息样本对应的优化后的文本信息一致。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910430228.7A CN110349575A (zh) | 2019-05-22 | 2019-05-22 | 语音识别的方法、装置、电子设备和存储介质 |
CN201910430228.7 | 2019-05-22 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020233363A1 true WO2020233363A1 (zh) | 2020-11-26 |
Family
ID=68173954
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/087471 WO2020233363A1 (zh) | 2019-05-22 | 2020-04-28 | 语音识别的方法、装置、电子设备和存储介质 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110349575A (zh) |
WO (1) | WO2020233363A1 (zh) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115457979A (zh) * | 2022-09-22 | 2022-12-09 | 赵显阳 | 一种视频语音分析识别处理方法及系统 |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110349575A (zh) * | 2019-05-22 | 2019-10-18 | 深圳壹账通智能科技有限公司 | 语音识别的方法、装置、电子设备和存储介质 |
CN110956955B (zh) * | 2019-12-10 | 2022-08-05 | 思必驰科技股份有限公司 | 一种语音交互的方法和装置 |
CN112309387A (zh) * | 2020-02-26 | 2021-02-02 | 北京字节跳动网络技术有限公司 | 用于处理信息的方法和装置 |
CN112259083B (zh) * | 2020-10-16 | 2024-02-13 | 北京猿力未来科技有限公司 | 音频处理方法及装置 |
CN112786055A (zh) * | 2020-12-25 | 2021-05-11 | 北京百度网讯科技有限公司 | 资源挂载方法、装置、设备、存储介质及计算机程序产品 |
CN114694645A (zh) * | 2020-12-31 | 2022-07-01 | 华为技术有限公司 | 一种确定用户意图的方法及装置 |
CN118246910A (zh) * | 2024-05-28 | 2024-06-25 | 国网山东省电力公司营销服务中心(计量中心) | 一种对话式线上缴费方法、系统、介质、设备及程序产品 |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104240698A (zh) * | 2014-09-24 | 2014-12-24 | 上海伯释信息科技有限公司 | 一种语音识别的方法 |
US20150120618A1 (en) * | 2013-10-25 | 2015-04-30 | Samsung Electronics Co., Ltd. | Artificial intelligence audio apparatus and operation method thereof |
CN105448292A (zh) * | 2014-08-19 | 2016-03-30 | 北京羽扇智信息科技有限公司 | 一种基于场景的实时语音识别系统和方法 |
CN105719649A (zh) * | 2016-01-19 | 2016-06-29 | 百度在线网络技术(北京)有限公司 | 语音识别方法及装置 |
CN106683662A (zh) * | 2015-11-10 | 2017-05-17 | 中国电信股份有限公司 | 一种语音识别方法和装置 |
CN109509473A (zh) * | 2019-01-28 | 2019-03-22 | 维沃移动通信有限公司 | 语音控制方法及终端设备 |
CN110349575A (zh) * | 2019-05-22 | 2019-10-18 | 深圳壹账通智能科技有限公司 | 语音识别的方法、装置、电子设备和存储介质 |
-
2019
- 2019-05-22 CN CN201910430228.7A patent/CN110349575A/zh active Pending
-
2020
- 2020-04-28 WO PCT/CN2020/087471 patent/WO2020233363A1/zh active Application Filing
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150120618A1 (en) * | 2013-10-25 | 2015-04-30 | Samsung Electronics Co., Ltd. | Artificial intelligence audio apparatus and operation method thereof |
CN105448292A (zh) * | 2014-08-19 | 2016-03-30 | 北京羽扇智信息科技有限公司 | 一种基于场景的实时语音识别系统和方法 |
CN104240698A (zh) * | 2014-09-24 | 2014-12-24 | 上海伯释信息科技有限公司 | 一种语音识别的方法 |
CN106683662A (zh) * | 2015-11-10 | 2017-05-17 | 中国电信股份有限公司 | 一种语音识别方法和装置 |
CN105719649A (zh) * | 2016-01-19 | 2016-06-29 | 百度在线网络技术(北京)有限公司 | 语音识别方法及装置 |
CN109509473A (zh) * | 2019-01-28 | 2019-03-22 | 维沃移动通信有限公司 | 语音控制方法及终端设备 |
CN110349575A (zh) * | 2019-05-22 | 2019-10-18 | 深圳壹账通智能科技有限公司 | 语音识别的方法、装置、电子设备和存储介质 |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115457979A (zh) * | 2022-09-22 | 2022-12-09 | 赵显阳 | 一种视频语音分析识别处理方法及系统 |
Also Published As
Publication number | Publication date |
---|---|
CN110349575A (zh) | 2019-10-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020233363A1 (zh) | 语音识别的方法、装置、电子设备和存储介质 | |
US11887604B1 (en) | Speech interface device with caching component | |
WO2021093449A1 (zh) | 基于人工智能的唤醒词检测方法、装置、设备及介质 | |
CN108520743B (zh) | 智能设备的语音控制方法、智能设备及计算机可读介质 | |
US11817094B2 (en) | Automatic speech recognition with filler model processing | |
US11087739B1 (en) | On-device learning in a hybrid speech processing system | |
CN108133707B (zh) | 一种内容分享方法及系统 | |
KR102201937B1 (ko) | 후속 음성 쿼리 예측 | |
CN107134279A (zh) | 一种语音唤醒方法、装置、终端和存储介质 | |
US8532992B2 (en) | System and method for standardized speech recognition infrastructure | |
CN114830228A (zh) | 与设备关联的账户 | |
CN110047481B (zh) | 用于语音识别的方法和装置 | |
US11830482B2 (en) | Method and apparatus for speech interaction, and computer storage medium | |
WO2020024620A1 (zh) | 语音信息的处理方法以及装置、设备和存储介质 | |
CN111916088B (zh) | 一种语音语料的生成方法、设备及计算机可读存储介质 | |
CN111462741B (zh) | 语音数据处理方法、装置及存储介质 | |
CN108055617B (zh) | 一种麦克风的唤醒方法、装置、终端设备及存储介质 | |
JP2022037100A (ja) | 車載機器の音声処理方法、装置、機器及び記憶媒体 | |
CN106649253B (zh) | 基于后验证的辅助控制方法及系统 | |
CN111833857B (zh) | 语音处理方法、装置和分布式系统 | |
JP7308335B2 (ja) | 車載音声機器のテスト方法、装置、電子機器及び記憶媒体 | |
CN103514882A (zh) | 一种语音识别方法及系统 | |
JP2022101663A (ja) | ヒューマンコンピュータインタラクション方法、装置、電子機器、記憶媒体およびコンピュータプログラム | |
US20240013784A1 (en) | Speaker recognition adaptation | |
CN111091819A (zh) | 语音识别装置和方法、语音交互系统和方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20810171 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20810171 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 22.03.2022) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20810171 Country of ref document: EP Kind code of ref document: A1 |