CN116959431A - Speech recognition method, device, vehicle, electronic equipment and storage medium - Google Patents

Speech recognition method, device, vehicle, electronic equipment and storage medium Download PDF

Info

Publication number
CN116959431A
CN116959431A CN202210971581.8A CN202210971581A CN116959431A CN 116959431 A CN116959431 A CN 116959431A CN 202210971581 A CN202210971581 A CN 202210971581A CN 116959431 A CN116959431 A CN 116959431A
Authority
CN
China
Prior art keywords
invalid
voice
semantics
recognized
semantic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210971581.8A
Other languages
Chinese (zh)
Inventor
王涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Co Wheels Technology Co Ltd
Original Assignee
Beijing Co Wheels Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Co Wheels Technology Co Ltd filed Critical Beijing Co Wheels Technology Co Ltd
Priority to CN202210971581.8A priority Critical patent/CN116959431A/en
Publication of CN116959431A publication Critical patent/CN116959431A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The disclosure discloses a voice recognition method, a device, a vehicle, electronic equipment and a storage medium, and relates to the technical field of vehicles, wherein the main technical scheme comprises the following steps: carrying out semantic recognition on the received voice to be recognized, and determining whether the semantic recognition result contains invalid semantics; if the invalid semantic is determined to be contained, determining an invalid voice duration corresponding to the invalid semantic; if the invalid voice duration is smaller than the preset duration threshold, discarding the invalid semantics in the voice to be recognized to obtain a semantic recognition result which does not contain the invalid semantics. After recognizing that the invalid semantics exist in the voice to be recognized, detecting the voice duration of the invalid semantics, if the voice duration is determined to be smaller than the preset duration threshold, directly discarding the invalid semantics smaller than the preset duration threshold, preventing abnormal dialogue prompt caused by voice dialogue of the user due to the invalid audio segment, and improving the voice dialogue experience of the user.

Description

Speech recognition method, device, vehicle, electronic equipment and storage medium
Technical Field
The disclosure relates to the technical field of vehicles, and in particular relates to a voice recognition method, a voice recognition device, a vehicle, electronic equipment and a storage medium.
Background
The voice interaction technology is an integrated technology, and takes voice as a base information carrier, so that a machine has the interaction capability of 'listening to a conversation, naturally interacting and asking for a response' like a person.
The voice interaction process comprises four parts: speech acquisition, speech recognition (Automatic Speech Recognition, ASR), natural language understanding (Natural Language Understanding, NLU), and speech synthesis (TTS). The voice acquisition is used for completing the recording, sampling and encoding of the audio, the ASR is used for converting the voice information into text information which can be recognized by a machine, the NLP completes corresponding operation according to the text characters or commands after the voice recognition is converted, and the TTS completes the conversion from the text information to the sound information.
At present, after the voice interaction system is awakened, all the voice of the user can be received for voice recognition, including some ineffective semantic short voice of the user, such as cough voice of the user, and because the voice recognition ASR can not recognize the ineffective semantic short voice of the cough and the like, the voice dialogue abnormal bottom-free flow is triggered, namely, the voice dialogue abnormality is prompted or the user is required to re-enter the voice, so that the voice interaction experience of the user is greatly reduced.
Disclosure of Invention
The disclosure provides a voice recognition method, a voice recognition device, electronic equipment and a storage medium. The method mainly aims to solve the problem that the voice interaction experience of a user is affected due to the fact that invalid semantics short audio triggers an abnormal bottom flow of a voice conversation.
According to a first aspect of the present disclosure, there is provided a method for recognizing speech, including:
carrying out semantic recognition on the received voice to be recognized, and determining whether the semantic recognition result contains invalid semantics;
if the invalid semantic is determined to be contained, determining an invalid voice duration corresponding to the invalid semantic;
if the invalid voice duration is smaller than the preset duration threshold, discarding the invalid semantics in the voice to be recognized to obtain a semantic recognition result which does not contain the invalid semantics.
Optionally, the method further comprises:
and if the invalid time length of the invalid semantics is determined to be greater than or equal to the preset time length threshold, carrying out semantic recognition on the voice to be recognized containing the invalid audio frequency segment, wherein the invalid audio frequency segment is the audio frequency segment corresponding to the invalid semantics.
Optionally, the method further comprises:
if the re-identified identification result is determined to comprise the target invalid semantics, counting the occurrence times of the target invalid semantics; the target invalidation semantics are invalidation semantics of which the invalidation time length of the audio corresponding to the invalidation semantics is greater than or equal to the preset time length threshold;
if the occurrence number of the target invalid semantics is not more than a preset number threshold, outputting a voice prompt for re-entering the voice to be recognized;
and if the occurrence times of the target invalid semantics exceeds the preset times threshold, exiting the voice interaction.
Optionally, after discarding the invalid semantics in the speech to be recognized, the method further includes:
and performing voice interaction according to the semantic recognition result which does not contain invalid semantics.
Optionally, the method further comprises:
and if the semantic recognition of the voice to be recognized is overtime, outputting voice prompt information of abnormal recognition.
According to a second aspect of the present disclosure, there is provided a voice recognition apparatus including:
the first determining unit is used for carrying out semantic recognition on the received voice to be recognized and determining whether the semantic recognition result contains invalid semantics or not;
the second determining unit is used for determining the invalid voice duration corresponding to the invalid semantics when determining that the invalid semantics are contained;
the discarding unit is used for discarding the invalid semantics in the voice to be recognized when the invalid voice duration is determined to be smaller than a preset duration threshold;
the acquisition unit is used for acquiring a semantic recognition result which does not contain invalid semantics.
Optionally, the apparatus further includes:
and the recognition unit is used for carrying out semantic recognition on the voice to be recognized containing the invalid audio segment again when the invalid time length of the invalid semantic is greater than or equal to the preset time length threshold, wherein the invalid audio segment is the invalid audio segment corresponding to the invalid semantic.
Optionally, the apparatus further includes:
a statistics unit for counting the occurrence number of the target invalid semantics when it is determined that the re-recognized recognition result includes the target invalid semantics; the target invalidation semantics are invalidation semantics of which the invalidation time length of the audio corresponding to the invalidation semantics is greater than or equal to the preset time length threshold;
the first output unit is used for outputting a voice prompt for re-inputting the voice to be recognized when the occurrence frequency of the target invalid semantics is not more than a preset frequency threshold value;
and the exit unit is used for exiting the voice interaction when the occurrence times of the target invalid semantics exceeds the preset times threshold value.
Optionally, the apparatus further includes:
and the interaction unit is used for carrying out voice interaction according to the semantic recognition result which does not contain invalid semantics after the discarding unit.
Optionally, the apparatus further includes:
and the second output unit is used for outputting voice prompt information of abnormal recognition when the semantic recognition of the voice to be recognized is overtime.
In a third aspect of the present disclosure, there is provided a vehicle including the speech recognition device of the foregoing second aspect.
According to a fourth aspect of the present disclosure, there is provided an electronic device comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect.
According to a fifth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of the preceding first aspect.
According to a sixth aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of the first aspect described above.
The voice recognition method, the voice recognition device, the electronic equipment and the storage medium provided by the disclosure firstly carry out semantic recognition on received voice to be recognized, and determine whether a semantic recognition result contains invalid semantics; secondly, if the invalid semantics are determined to be contained, determining the invalid voice duration corresponding to the invalid semantics; and finally, if the invalid voice duration is determined to be smaller than a preset duration threshold, discarding the invalid semantics in the voice to be recognized to obtain a semantic recognition result which does not contain the invalid semantics. Compared with the related art, after recognizing that the invalid semantics exist in the voice to be recognized, the embodiment of the application detects the voice duration of the invalid semantics, if the voice duration is determined to be smaller than the preset duration threshold, the invalid semantics smaller than the preset duration threshold are directly discarded, abnormal dialogue prompt caused by voice dialogue of the user by the invalid voice segment is prevented, and the voice dialogue experience of the user is improved.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the application or to delineate the scope of the application. Other features of the present application will become apparent from the description that follows.
Drawings
The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
fig. 1 is a flowchart of a voice recognition method according to an embodiment of the disclosure;
FIG. 2 is a flow chart of another method for speech recognition according to an embodiment of the present disclosure;
fig. 3 is a schematic structural diagram of a voice recognition device according to an embodiment of the disclosure;
fig. 4 is a schematic structural diagram of another voice recognition device according to an embodiment of the disclosure;
fig. 5 is a schematic block diagram of an example electronic device 400 provided by an embodiment of the disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The following describes a method, an apparatus, an electronic device, and a storage medium for recognizing speech according to an embodiment of the present disclosure with reference to the accompanying drawings.
Fig. 1 is a flowchart of a voice recognition method according to an embodiment of the disclosure.
As shown in fig. 1, the method comprises the steps of:
step 101, carrying out semantic recognition on the received voice to be recognized, and determining whether the semantic recognition result contains invalid semantics.
After the voice interaction system is awakened, the voice to be recognized in the surrounding environment is automatically received, semantic recognition is carried out on the received voice to be recognized, and after an effective instruction is recognized, corresponding operation is carried out; however, in actual use, there may be cases such as noise, too little user's voice, or coughing, which may result in recognition results being invalid semantics, or recognition semantics being too long.
When the semantic recognition is carried out, the following two modes can be adopted, firstly, after the semantic recognition is carried out on all the voices to be recognized, whether invalid semantics are contained in the semantic recognition result is determined; second, when the voice to be recognized is semantically recognized, the recognized semantic is judged to be non-invalid in real time. Specifically, the embodiment of the application does not limit the time for judging the ineffective semantics.
Step 102, if the invalid semantics are determined to be included, determining an invalid voice duration corresponding to the invalid semantics.
When receiving the voice to be recognized, automatically recording the duration of the voice to be recognized, segmenting the voice to be recognized according to the voice consistency, sentence meaning understanding and the like of the voice to be recognized, directly acquiring an invalid voice segment corresponding to the invalid meaning after determining that the invalid meaning exists, comparing the duration of the invalid voice segment with a preset duration threshold value, determining that the voice to be recognized contains invalid short audio when determining that the duration of the invalid voice is smaller than the preset duration threshold value, and executing step 103.
The preset duration threshold is a tested value, and can be set according to actual requirements of different application scenes, for example, the preset duration threshold is set to 700ms or 500ms, and the setting of the preset duration threshold is not limited in the embodiment of the application.
Step 103, if the invalid voice duration is determined to be smaller than the preset duration threshold, discarding the invalid semantics in the voice to be recognized to obtain a semantic recognition result which does not contain the invalid semantics.
In practical applications, there are two cases of invalid semantics, the first one: only noise, no speaking sound of the user; second, some noise is mixed in the speech sound of the user; wherein the noise can be words without practical meaning such as cough, qi, etc.
When the first condition occurs, recognizing that the voice is invalid, determining that the time length is smaller than a preset time length threshold, and discarding all the voice to be recognized currently after determining that the voice is invalid; when the second situation occurs, because the voice to be recognized contains voice segments with valid semantics in addition to the invalid voice segments (voice containing noise), after the voice to be recognized is recognized and the invalid semantics exist in the semantic recognition result, only the invalid semantics of the voice to be recognized are discarded, and the semantic recognition result which does not contain the invalid semantics is reserved.
Illustratively, let the user issue a voice instruction: when the voice to be recognized is semantically recognized, because effective instructions are not recognized due to interference of noise cough, and when short audio (invalid audio segments) exist in the voice to be recognized, only short audio (invalid audio segments) corresponding to the middle cough is discarded, and only voice data corresponding to open music are reserved.
Firstly, carrying out semantic recognition on received voice to be recognized, and determining whether a semantic recognition result contains invalid semantics; secondly, if the invalid semantics are determined to be contained, determining the invalid voice duration corresponding to the invalid semantics; and finally, if the invalid voice duration is determined to be smaller than a preset duration threshold, discarding the invalid semantics in the voice to be recognized to obtain a semantic recognition result which does not contain the invalid semantics. Compared with the related art, after recognizing that the invalid semantics exist in the voice to be recognized, the embodiment of the application detects the voice duration of the invalid semantics, if the voice duration is determined to be smaller than the preset duration threshold, the invalid semantics smaller than the preset duration threshold are directly discarded, abnormal dialogue prompt caused by voice dialogue of the user by the invalid voice segment is prevented, and the voice dialogue experience of the user is improved.
The above embodiment details a scenario in which when the ineffective semantics are smaller than the preset duration threshold, that is, when ineffective short audio exists in the voice to be recognized, in practical application, a scenario in which the ineffective time length of the ineffective semantics is greater than or equal to the preset duration threshold also exists, and for the scenario, performing semantic recognition on the voice to be recognized including the ineffective audio segment, where the ineffective audio segment is the audio segment corresponding to the ineffective semantics. The embodiment of the application provides the following two solutions:
mode one: and if the invalid time length of the invalid semantics is determined to be greater than or equal to the preset time length threshold, carrying out semantic recognition on the invalid audio segment corresponding to the invalid semantics again.
Besides the current ineffective semantics, the method also comprises effective semantics, which indicates that the ineffective audio frequency section and other voice sections are independent voice sections, so that only the ineffective audio frequency section is subjected to independent semantic recognition.
Mode two: and if the invalid time length of the invalid semantics is determined to be greater than or equal to the preset time length threshold, re-performing semantic recognition on the voice to be recognized.
And after the recognition that the voice to be recognized contains invalid semantics, carrying out overall re-recognition on the voice to be recognized containing invalid voice segments.
As another implementation manner of the embodiment of the application, when the invalid time length of the invalid semantics is greater than or equal to the preset time length threshold, the speech prompt can be directly output without re-recognition of the two modes, for example, the user is prompted to speak again.
As an extension to the embodiment of the above application, when the recognition of the semantics of the speech to be recognized including the invalid audio segment is performed again, if it is determined that the recognition result of the re-recognition includes the target invalid semantics, counting the occurrence times of the target invalid semantics; the target invalidation semantics are invalidation semantics of which the invalidation time length of the audio corresponding to the invalidation semantics is greater than or equal to the preset time length threshold; if the occurrence number of the target invalid semantics is not more than a preset number threshold, outputting a voice prompt for re-entering the voice to be recognized; and if the occurrence times of the target invalid semantics exceeds the preset times threshold, exiting the voice interaction.
As an extension to the embodiment of the foregoing application, an embodiment of the present application provides another method for recognizing speech, as shown in fig. 2, where the method includes:
and step 201, carrying out semantic recognition on the received voice to be recognized.
Step 202, it is confirmed whether the semantic recognition is timed out.
If the time-out is over, step 203 is executed, and if the time-out is not over, step 204 is executed.
Before executing the step, a semantic recognition time threshold is set, for example, 4 seconds or 5 seconds, and can be set by itself according to actual conditions.
Step 203, outputting voice prompt information for identifying abnormality.
And stopping recognition for the voice to be recognized with overtime recognition, discarding the voice to be recognized, prompting the user of abnormal information through voice broadcasting, and then exiting voice interaction.
Step 204, determining whether the semantic recognition result contains invalid semantics.
Based on whether the semantic recognition result is a valid instruction, if it is determined that the invalid semantic is not included, step 205 is executed, and if it is determined that the invalid semantic is included, step 206 is executed.
And step 205, carrying out voice dialogue according to the semantic recognition result.
According to the instruction information in the semantics, executing corresponding operation, and giving feedback to the user through voice broadcasting, for example: good, open, etc. The embodiment of the application does not limit the specific content in the voice dialogue.
Step 206, determining whether the invalid voice duration corresponding to the invalid semantics is smaller than a preset duration threshold.
If it is determined that the invalid voice duration is less than the preset duration threshold, step 207 is executed, and if it is determined that the invalid duration of the invalid semantic is greater than or equal to the preset duration threshold, step 208 is executed.
And step 207, discarding the invalid semantics in the voice to be recognized to obtain a semantic recognition result which does not contain the invalid semantics.
And step 208, carrying out semantic recognition on the invalid semantics again, or carrying out semantic recognition on the voice to be recognized again.
Step 209, if it is determined that the re-identified identification result includes the target invalid semantics, counting the occurrence times of the target invalid semantics; the target invalidation semantics are invalidation semantics of the audio corresponding to the invalidation semantics, wherein the invalidation time length of the audio is greater than or equal to the preset time length threshold.
If it is determined that the number of occurrences of the target invalid semantic does not exceed the preset number of times threshold, step 210 is performed, and if it is determined that the number of occurrences of the target invalid semantic exceeds the preset number of times threshold, step 211 is performed.
Step 210, outputting a voice prompt for re-entering the voice to be recognized.
Step 211, the voice interaction is exited.
It should be noted that, referring to steps 201 to 211 in the voice recognition method shown in fig. 2, the description of the foregoing embodiments may be referred to, and the embodiments of the present application are not described herein again.
As a possible implementation manner, after the user wakes up the voice interaction system, when the system does not recognize voice data for a long time, the voice interaction system is automatically turned off, and the specific time can be set to 7 seconds or 8 seconds, and the embodiment takes the timeout time as an example for illustration; when the audio information received by the system is an invalidation instruction and short audio data, the data duration corresponding to the current invalidation data is also included in the timeout timing time, for example, a section of voice data is received, is discarded after semantic detection and short audio judgment, the discarded short audio duration is 500ms, and the remaining timeout timing time is 6.5 seconds.
Corresponding to the voice recognition method, the application also provides a voice recognition device. Since the device embodiment of the present application corresponds to the above-mentioned method embodiment, details not disclosed in the device embodiment may refer to the above-mentioned method embodiment, and details are not described in detail in the present application.
Fig. 3 is a schematic structural diagram of a voice recognition device according to an embodiment of the present disclosure, where, as shown in fig. 3, the voice recognition device includes:
a first determining unit 31, configured to perform semantic recognition on the received voice to be recognized, and determine whether the semantic recognition result includes invalid semantics;
a second determining unit 32, configured to determine, when determining that the invalid semantic includes invalid semantics, an invalid voice duration corresponding to the invalid semantic;
a discarding unit 33, configured to discard invalid semantics in the speech to be recognized when it is determined that the invalid speech duration is less than a preset duration threshold;
an obtaining unit 34, configured to obtain a semantic recognition result that does not include invalid semantics.
The voice recognition device provided by the disclosure firstly carries out semantic recognition on received voice to be recognized, and determines whether a semantic recognition result contains invalid semantics; secondly, if the invalid semantics are determined to be contained, determining the invalid voice duration corresponding to the invalid semantics; and finally, if the invalid voice duration is determined to be smaller than a preset duration threshold, discarding the invalid semantics in the voice to be recognized to obtain a semantic recognition result which does not contain the invalid semantics. Compared with the related art, after recognizing that the invalid semantics exist in the voice to be recognized, the embodiment of the application detects the voice duration of the invalid semantics, if the voice duration is determined to be smaller than the preset duration threshold, the invalid semantics smaller than the preset duration threshold are directly discarded, abnormal dialogue prompt caused by voice dialogue of the user by the invalid voice segment is prevented, and the voice dialogue experience of the user is improved.
Further, in a possible implementation manner of this embodiment, as shown in fig. 4, the apparatus further includes:
and the identifying unit 35 is configured to, when determining that the invalidation time length of the invalidation semantics is greater than or equal to the preset time length threshold, re-identify the semantics of the speech to be identified including the invalidation audio segment, where the invalidation audio segment is the invalidation audio segment corresponding to the invalidation semantics.
Further, in a possible implementation manner of this embodiment, as shown in fig. 4, the apparatus further includes:
a statistics unit 36 for counting the number of occurrences of the target invalid semantics when it is determined that the recognition result of the re-recognition includes the target invalid semantics; the target invalidation semantics are invalidation semantics of which the invalidation time length of the audio corresponding to the invalidation semantics is greater than or equal to the preset time length threshold;
a first output unit 37, configured to output a voice prompt for re-entering a voice to be recognized when it is determined that the number of occurrences of the target invalid semantics does not exceed a preset number threshold;
and an exit unit 38, configured to exit the voice interaction when it is determined that the occurrence number of the target invalidation semantics exceeds the preset number threshold.
Further, in a possible implementation manner of this embodiment, as shown in fig. 4, the apparatus further includes:
and the interaction unit 39 is used for performing voice interaction according to the semantic recognition result which does not contain invalid semantics after the discarding unit.
Further, in a possible implementation manner of this embodiment, as shown in fig. 4, the apparatus further includes:
and a second output unit 310, configured to output voice prompt information of abnormal recognition when the semantic recognition of the voice to be recognized is timeout.
The foregoing explanation of the method embodiment is also applicable to the apparatus of this embodiment, and the principle is the same, and this embodiment is not limited thereto.
According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.
Fig. 5 shows a schematic block diagram of an example electronic device 400 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 5, the apparatus 400 includes a computing unit 401 that can perform various appropriate actions and processes according to a computer program stored in a ROM (Read-Only Memory) 402 or a computer program loaded from a storage unit 408 into a RAM (Random Access Memory ) 403. In RAM 403, various programs and data required for the operation of device 400 may also be stored. The computing unit 401, ROM 402, and RAM 403 are connected to each other by a bus 404. An I/O (Input/Output) interface 405 is also connected to bus 404.
Various components in device 400 are connected to I/O interface 405, including: an input unit 406 such as a keyboard, a mouse, etc.; an output unit 407 such as various types of displays, speakers, and the like; a storage unit 408, such as a magnetic disk, optical disk, etc.; and a communication unit 409 such as a network card, modem, wireless communication transceiver, etc. The communication unit 409 allows the device 400 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
The computing unit 401 may be a variety of general purpose and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 401 include, but are not limited to, a CPU (Central Processing Unit ), a GPU (Graphic Processing Units, graphics processing unit), various dedicated AI (Artificial Intelligence ) computing chips, various computing units running machine learning model algorithms, a DSP (Digital Signal Processor ), and any suitable processor, controller, microcontroller, etc. The computing unit 401 performs the respective methods and processes described above, for example, a recognition method of voice. For example, in some embodiments, the method of speech recognition may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 408. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 400 via the ROM 402 and/or the communication unit 409. When the computer program is loaded into RAM 403 and executed by computing unit 401, one or more steps of the method described above may be performed. Alternatively, in other embodiments, the computing unit 401 may be configured to perform the aforementioned speech recognition method in any other suitable way (e.g. by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit System, FPGA (Field Programmable Gate Array ), ASIC (Application-Specific Integrated Circuit, application-specific integrated circuit), ASSP (Application Specific Standard Product, special-purpose standard product), SOC (System On Chip ), CPLD (Complex Programmable Logic Device, complex programmable logic device), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, RAM, ROM, EPROM (Electrically Programmable Read-Only-Memory, erasable programmable read-Only Memory) or flash Memory, an optical fiber, a CD-ROM (Compact Disc Read-Only Memory), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., CRT (Cathode-Ray Tube) or LCD (Liquid Crystal Display ) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: LAN (Local Area Network ), WAN (Wide Area Network, wide area network), internet and blockchain networks.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.
It should be noted that, artificial intelligence is a subject of studying a certain thought process and intelligent behavior (such as learning, reasoning, thinking, planning, etc.) of a computer to simulate a person, and has a technology at both hardware and software level. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning/deep learning technology, a big data processing technology, a knowledge graph technology and the like.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (10)

1. The voice recognition method is characterized by being applied to a voice interaction process and comprising the following steps of:
carrying out semantic recognition on the received voice to be recognized, and determining whether the semantic recognition result contains invalid semantics;
if the invalid semantic is determined to be contained, determining an invalid voice duration corresponding to the invalid semantic;
if the invalid voice duration is smaller than the preset duration threshold, discarding the invalid semantics in the voice to be recognized to obtain a semantic recognition result which does not contain the invalid semantics.
2. The method according to claim 1, wherein the method further comprises:
and if the invalid time length of the invalid semantics is determined to be greater than or equal to the preset time length threshold, carrying out semantic recognition on the voice to be recognized containing the invalid audio frequency segment, wherein the invalid audio frequency segment is the audio frequency segment corresponding to the invalid semantics.
3. The method according to claim 2, wherein the method further comprises:
if the re-identified identification result is determined to comprise the target invalid semantics, counting the occurrence times of the target invalid semantics; the target invalidation semantics are invalidation semantics of which the invalidation time length of the audio corresponding to the invalidation semantics is greater than or equal to the preset time length threshold;
if the occurrence number of the target invalid semantics is not more than a preset number threshold, outputting a voice prompt for re-entering the voice to be recognized;
and if the occurrence times of the target invalid semantics exceeds the preset times threshold, exiting the voice interaction.
4. The method of claim 1, wherein after discarding the invalid semantics in the speech to be recognized, the method further comprises:
and performing voice interaction according to the semantic recognition result which does not contain invalid semantics.
5. The method according to claim 1, wherein the method further comprises:
and if the semantic recognition of the voice to be recognized is overtime, outputting voice prompt information of abnormal recognition.
6. A speech recognition apparatus, comprising:
the first determining unit is used for carrying out semantic recognition on the received voice to be recognized and determining whether the semantic recognition result contains invalid semantics or not;
the second determining unit is used for determining the invalid voice duration corresponding to the invalid semantics when determining that the invalid semantics are contained;
the discarding unit is used for discarding the invalid semantics in the voice to be recognized when the invalid voice duration is determined to be smaller than a preset duration threshold;
the acquisition unit is used for acquiring a semantic recognition result which does not contain invalid semantics.
7. A vehicle characterized in that it comprises the speech recognition device according to claim 6.
8. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5.
9. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-5.
10. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-5.
CN202210971581.8A 2022-08-12 2022-08-12 Speech recognition method, device, vehicle, electronic equipment and storage medium Pending CN116959431A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210971581.8A CN116959431A (en) 2022-08-12 2022-08-12 Speech recognition method, device, vehicle, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210971581.8A CN116959431A (en) 2022-08-12 2022-08-12 Speech recognition method, device, vehicle, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116959431A true CN116959431A (en) 2023-10-27

Family

ID=88459082

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210971581.8A Pending CN116959431A (en) 2022-08-12 2022-08-12 Speech recognition method, device, vehicle, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116959431A (en)

Similar Documents

Publication Publication Date Title
CN110047481B (en) Method and apparatus for speech recognition
CN113674746B (en) Man-machine interaction method, device, equipment and storage medium
CN112669867B (en) Debugging method and device of noise elimination algorithm and electronic equipment
CN111261143B (en) Voice wakeup method and device and computer readable storage medium
CN114842855A (en) Training and awakening method, device, equipment and storage medium of voice awakening model
CN113658586B (en) Training method of voice recognition model, voice interaction method and device
CN113611316A (en) Man-machine interaction method, device, equipment and storage medium
CN110085264B (en) Voice signal detection method, device, equipment and storage medium
CN114399992B (en) Voice instruction response method, device and storage medium
CN112669837B (en) Awakening method and device of intelligent terminal and electronic equipment
CN116959431A (en) Speech recognition method, device, vehicle, electronic equipment and storage medium
CN114121022A (en) Voice wake-up method and device, electronic equipment and storage medium
CN112509567B (en) Method, apparatus, device, storage medium and program product for processing voice data
CN113903329B (en) Voice processing method and device, electronic equipment and storage medium
CN114333017A (en) Dynamic pickup method and device, electronic equipment and storage medium
CN114429766A (en) Method, device and equipment for adjusting playing volume and storage medium
CN113808585A (en) Earphone awakening method, device, equipment and storage medium
CN115312042A (en) Method, apparatus, device and storage medium for processing audio
CN112669839A (en) Voice interaction method, device, equipment and storage medium
CN114203204B (en) Tail point detection method, device, equipment and storage medium
CN113643696B (en) Voice processing method, device, equipment, storage medium and program
CN114356275B (en) Interactive control method and device, intelligent voice equipment and storage medium
CN113345472B (en) Voice endpoint detection method and device, electronic equipment and storage medium
CN113448533B (en) Method and device for generating reminding audio, electronic equipment and storage medium
CN113870841A (en) Voice data processing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination