CN109302528B - Photographing method, mobile terminal and computer readable storage medium - Google Patents

Photographing method, mobile terminal and computer readable storage medium Download PDF

Info

Publication number
CN109302528B
CN109302528B CN201810955505.1A CN201810955505A CN109302528B CN 109302528 B CN109302528 B CN 109302528B CN 201810955505 A CN201810955505 A CN 201810955505A CN 109302528 B CN109302528 B CN 109302528B
Authority
CN
China
Prior art keywords
voice
voice signal
photographing
preset
photographing instruction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810955505.1A
Other languages
Chinese (zh)
Other versions
CN109302528A (en
Inventor
陈浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nubia Technology Co Ltd
Original Assignee
Nubia Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nubia Technology Co Ltd filed Critical Nubia Technology Co Ltd
Priority to CN201810955505.1A priority Critical patent/CN109302528B/en
Publication of CN109302528A publication Critical patent/CN109302528A/en
Application granted granted Critical
Publication of CN109302528B publication Critical patent/CN109302528B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72403User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality
    • H04M1/7243User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages
    • H04M1/72439User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages for image or video messaging
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72484User interfaces specially adapted for cordless or mobile telephones wherein functions are triggered by incoming communication events
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M2250/00Details of telephonic subscriber devices
    • H04M2250/74Details of telephonic subscriber devices with voice recognition means

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Business, Economics & Management (AREA)
  • General Business, Economics & Management (AREA)
  • Telephone Function (AREA)

Abstract

The invention discloses a terminal, which comprises a memory, a processor and a photographing method, wherein the photographing method is stored in the memory and can be operated on the processor; extracting voice signal characteristics in a voice photographing instruction to be recognized; searching whether a standard voice signal characteristic matched with the voice signal characteristic exists in a preset voice characteristic database; if the standard voice signal characteristics matched with the voice signal characteristics exist, executing the preset voice photographing instruction corresponding to the standard voice signal characteristics; because the preset voice photographing instruction comprises the preset photographing keyword and the photographing mode, after the voice instruction to be recognized is recognized, the preset voice photographing instruction is executed, the purpose that the photographing mode can be adjusted is achieved, the problem that the photographing mode cannot be adjusted by the voice instruction of the existing photographing program is solved, and the effects of enhancing man-machine interaction and improving user experience are achieved.

Description

Photographing method, mobile terminal and computer readable storage medium
Technical Field
The invention relates to the technical field of audio and video signal processing, in particular to a photographing method based on voice recognition, a mobile terminal and a computer readable storage medium.
Background
Speech Recognition technology, also known as Automatic Speech Recognition (ASR), is a technique that converts the vocabulary content in human Speech into computer-readable input; with the social progress and the rapid development of the information industry, the speech recognition technology is used as a key technology of man-machine interaction, and the application of the speech recognition technology is more and more extensive, for example, vehicle-mounted speech navigation, phone speech recognition dialing, speech intelligent toys and the like all relate to the speech recognition technology. The application of voice recognition to terminals, especially mobile terminals, is the most popular research direction of internet companies at present, and aims to quickly occupy the customer base through a convenient mode of voice interaction.
The mobile terminal is more and more widely used in daily life of people, and has great influence on the life and the communication of people. Taking a smart phone as an example, with the development of science and technology, it has started to gradually replace the traditional PC and penetrated into various aspects of people's entertainment and life, and from the single conversation function in the past, it has integrated the functions of conversation, photography, internet surfing, short message, shopping, video, etc. today. The science and technology development has to say that the smart phone brings great convenience to our life, and meanwhile, the requirements of people on the smart phone are increased. The existing smart phone comprises photographing software, a plurality of photographing software basically has a voice photographing function, the execution of the photographing software is mainly controlled through the recognition of voice commands, and the design brings more convenience and interactive experience to users. However, since these voice commands are generally specified by the system, the user can only take a picture of the voice through the voice command specified by the system; it brings about problems in that: 1. when a user wants to realize self-shooting through voice, the self-shooting effect realized by using the specified voice command cannot meet the requirements of each user simultaneously, for example, someone can achieve the beautiful smile by using a 'cheese' voice command, but prefers to use an 'eggplant' and the like, and 3. the shooting mode under the voice command must be set in advance, namely the user needs to set the shooting mode for shooting beauty, non-beauty, timing and the like firstly, and then set the voice command for shooting, so that inconvenience is brought to the user.
Therefore, it is necessary to design a voice recognition-based photographing method applied to a mobile terminal to solve the problem of the existing voice photographing software, improve the convenience of photographing of the mobile terminal, and provide better interactive experience for users.
Disclosure of Invention
The invention mainly aims to provide a photographing method and a photographing terminal based on voice recognition, and aims to solve the problems that the conventional photographing program based on voice recognition is easy to misjudge and a voice instruction cannot set a photographing mode, so that the effects of enhancing man-machine interaction and improving user experience are achieved.
Firstly, in order to achieve the above object, the present invention provides a photographing method based on voice recognition, which is applied to a mobile terminal, and the photographing method includes the following steps:
acquiring a voice photographing instruction to be recognized;
extracting voice signal characteristics in the voice photographing instruction to be recognized;
searching whether a standard voice signal characteristic matched with the voice signal characteristic exists in a preset voice characteristic database; the preset voice feature database is a correlation database of a preset voice photographing instruction and the standard voice signal feature corresponding to the preset voice photographing instruction, and the preset voice photographing instruction comprises a preset photographing keyword and a photographing mode corresponding to the photographing keyword;
and if the standard voice signal characteristics matched with the voice signal characteristics exist, executing the preset voice photographing instruction corresponding to the standard voice signal characteristics.
Optionally, the speech signal features include first-level speech signal features and second-level speech signal features, and the standard speech signal features include first-level standard speech signal features and second-level standard speech signal features.
Optionally, the voice signal features in the voice photographing instruction to be recognized are extracted; the step of searching whether the standard voice signal characteristics matched with the voice signal characteristics exist in a preset voice characteristic database comprises the following steps:
extracting the first-stage voice signal characteristics in the voice photographing instruction to be recognized;
searching whether a first-stage standard voice signal characteristic matched with the first-stage voice signal characteristic exists in the preset voice characteristic database;
if the first-level standard voice signal characteristics matched with the first-level voice signal characteristics exist, extracting second-level voice signal characteristics in the voice photographing instruction to be recognized;
judging whether the second-level voice signal characteristics are matched with second-level standard voice signal characteristics of the preset photographing instruction corresponding to the first-level standard voice signal characteristics;
and if the second-level standard voice signal characteristics are matched with the second-level voice signal characteristics, executing the preset voice photographing instruction corresponding to the first-level standard voice signal characteristics.
Optionally, the first-level voice signal feature includes an energy feature and an amplitude feature of the voice photographing instruction to be recognized, and the first-level standard voice signal feature includes an energy feature and an amplitude feature of the preset voice photographing instruction.
Optionally, the step of "searching in a preset speech feature database for whether there is a standard speech signal feature matching the speech signal feature" includes the following steps:
and searching whether a characteristic interval of the standard voice signal characteristic exists in the preset voice characteristic database so as to enable the voice signal characteristic not to exceed the characteristic interval.
Optionally, before the "acquiring the voice photographing instruction to be recognized", constructing the preset voice feature database; the step of building the preset voice feature database comprises the following steps:
collecting the preset voice photographing instruction;
extracting standard voice signal characteristics corresponding to the preset voice photographing instruction;
and establishing an associated database of the preset voice photographing instruction and the standard voice signal characteristics corresponding to the preset voice photographing instruction.
Optionally, before the "extracting the voice signal feature in the voice photographing instruction to be recognized", the following steps are further included:
and preprocessing the voice photographing instruction to be recognized.
Optionally, the step of executing the preset voice photographing instruction corresponding to the standard voice signal feature if the standard voice signal feature matched with the voice signal feature exists further includes the following steps:
if the standard voice signal characteristics matched with the voice signal characteristics do not exist, the voice photographing instruction to be recognized is obtained again; or prompting to update the preset voice feature database.
A mobile terminal comprises a memory, a processor and a photographing program based on voice recognition, wherein the photographing program is stored on the memory and can run on the processor, and when being executed by the processor, the steps of the photographing method are realized.
To achieve the above object, the present invention further provides a computer-readable storage medium, which stores a photographing program, and when the photographing program is executed by a processor, the steps of the photographing method are implemented.
Compared with the prior art, the photographing method based on voice recognition, the mobile terminal and the computer readable storage medium provided by the invention have the advantages that the voice photographing instruction to be recognized is obtained; extracting voice signal characteristics in the voice photographing instruction to be recognized; searching whether a standard voice signal characteristic matched with the voice signal characteristic exists in a preset voice characteristic database; if the standard voice signal characteristics matched with the voice signal characteristics exist, executing the preset voice photographing instruction corresponding to the standard voice signal characteristics; the voice photographing method comprises the steps that a voice photographing instruction to be recognized is recognized based on a voice recognition technology, the preset voice photographing instruction comprises a preset photographing keyword and a photographing mode, and the preset voice photographing instruction is executed after the voice instruction to be recognized is recognized, so that the purposes of photographing and adjusting the photographing mode are achieved, the problem that the conventional voice instruction of a photographing program based on voice recognition cannot adjust the photographing mode is solved, and the effects of enhancing human-computer interaction and improving user experience are achieved; meanwhile, the convenience of photographing of the mobile terminal is improved, and better interaction experience is provided for users.
Drawings
Fig. 1 is a schematic diagram of a hardware structure of an optional terminal for implementing various embodiments of the present invention;
fig. 2 is a diagram of a communication network system architecture according to an embodiment of the present invention;
FIG. 3 is a schematic view of a first embodiment of a photographing method according to the present invention;
FIG. 4 is a schematic diagram of an implementation process of constructing a preset speech feature database according to the present invention;
FIG. 5 is a schematic view of a second embodiment of a photographing method according to the present invention;
FIG. 6 is a functional block diagram of a first embodiment of a photographing program according to the present invention;
FIG. 7 is a functional block diagram of a second embodiment of a photographing program according to the present invention;
FIG. 8 is a schematic interface diagram illustrating an embodiment of the present invention when a preset voice photographing instruction is collected;
FIG. 9 is a diagram illustrating an interface for prompting updating of a speech feature database according to an embodiment of the present invention.
Reference numerals:
Figure RE-GDA0001830639150000041
Figure RE-GDA0001830639150000051
Figure RE-GDA0001830639150000061
the implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In the following description, suffixes such as "module", "component", or "unit" used to denote elements are used only for facilitating the explanation of the present invention, and have no specific meaning in itself. Thus, "module", "component" or "unit" may be used mixedly.
The terminal may be implemented in various forms. For example, the terminal described in the present invention may include a mobile terminal such as a mobile phone, a tablet computer, a notebook computer, a palmtop computer, a Personal Digital Assistant (PDA), a Portable Media Player (PMP), a navigation device, a wearable device, a smart band, a pedometer, and the like, and a fixed terminal such as a Digital TV, a desktop computer, and the like.
In the following description, a terminal will be exemplified, and it will be understood by those skilled in the art that the configuration of the embodiment of the present invention can be applied to a mobile terminal in addition to a fixed type terminal, particularly after adding elements particularly for mobile purposes.
Referring to fig. 1, which is a schematic diagram of a hardware structure of a terminal for implementing various embodiments of the present invention, the terminal 100 may include: RF (Radio Frequency) unit 101, WiFi module 102, audio output unit 103, a/V (audio/video) input unit 104, sensor 105, display unit 106, user input unit 107, interface unit 108, memory 109, processor 110, and power supply 111. Those skilled in the art will appreciate that the terminal configuration shown in fig. 1 is not intended to be limiting, and that the terminal may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
The following describes the various components of the terminal in detail with reference to fig. 1:
the radio frequency unit 101 may be configured to receive and transmit signals during information transmission and reception or during a call, and specifically, receive downlink information of a base station and then process the downlink information to the processor 110; in addition, the uplink data is transmitted to the base station. Typically, radio frequency unit 101 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, and the like. In addition, as applied to a mobile terminal, the radio frequency unit 101 may also communicate with a network and other devices through wireless communication. The wireless communication may use any communication standard or protocol, including but not limited to GSM (Global System for Mobile communications), GPRS (General Packet Radio Service), CDMA2000(Code Division Multiple Access 2000), WCDMA (Wideband Code Division Multiple Access), TD-SCDMA (Time Division-Synchronous Code Division Multiple Access), FDD-LTE (Frequency Division duplex Long Term Evolution), and TDD-LTE (Time Division duplex Long Term Evolution).
WiFi belongs to short-distance wireless transmission technology, and the terminal can help a user to receive and send e-mails, browse webpages, access streaming media and the like through the WiFi module 102, and provides wireless broadband internet access for the user. Although fig. 1 shows the WiFi module 102, it is understood that it does not belong to the essential constitution of the terminal, and may be omitted entirely as needed within the scope not changing the essence of the invention.
The audio output unit 103 may convert audio data received by the radio frequency unit 101 or the WiFi module 102 or stored in the memory 109 into an audio signal and output as sound when the terminal 100 is in a call signal reception mode, a call mode, a recording mode, a voice recognition mode, a broadcast reception mode, or the like. Also, the audio output unit 103 may also provide audio output related to a specific function performed by the terminal 100 (e.g., a call signal reception sound, a message reception sound, etc.). The audio output unit 103 may include a speaker, a buzzer, and the like.
The a/V input unit 104 is used to receive audio or video signals. The a/V input Unit 104 may include a Graphics Processing Unit (GPU) 1041 and a microphone 1042, the Graphics processor 1041 Processing image data of still pictures or video obtained by an image capturing device (e.g., a camera) in a video capturing mode or an image capturing mode. The processed image frames may be displayed on the display unit 106. The image frames processed by the graphic processor 1041 may be stored in the memory 109 (or other storage medium) or transmitted via the radio frequency unit 101 or the WiFi module 102. The microphone 1042 may receive sounds (audio data) via the microphone 1042 in a phone call mode, a recording mode, a voice recognition mode, or the like, and may be capable of processing such sounds into audio data. The processed audio (voice) data may be converted into a format output transmittable to a mobile communication base station via the radio frequency unit 101 in case of a phone call mode. The microphone 1042 may implement various types of noise cancellation (or suppression) algorithms to cancel (or suppress) noise or interference generated in the course of receiving and transmitting audio signals.
The terminal 100 also includes at least one sensor 105, such as a light sensor, a temperature sensor, and other sensors. In particular, the light sensor includes an ambient light sensor and a proximity sensor, wherein the ambient light sensor can adjust the brightness of the display panel 1061 according to the brightness of ambient light, and when the terminal 100 is moved to the ear, the light sensor can turn off the display panel 1061 and/or the backlight. In addition, a motion sensor is generally added to the mobile terminal, as one type of the motion sensor, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally three axes), can detect the magnitude and direction of gravity when the mobile terminal is stationary, and can be used for applications of recognizing the gesture of the mobile phone (such as horizontal and vertical screen switching, related games, magnetometer gesture calibration), vibration recognition related functions (such as pedometer and tapping), and the like; as for other sensors such as a fingerprint sensor, a pressure sensor, an iris sensor, a molecular sensor, a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured on the mobile phone, further description is omitted here.
The display unit 106 is used to display information input by a user or information provided to the user. The Display unit 106 may include a Display panel 1061, and the Display panel 1061 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like.
The user input unit 107 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the terminal. Specifically, the user input unit 107 may include a touch panel 1071 and other input devices 1072. The touch panel 1071, also referred to as a touch screen, may collect a touch operation performed by a user on or near the touch panel 1071 (e.g., an operation performed by the user on or near the touch panel 1071 using a finger, a stylus, or any other suitable object or accessory), and drive a corresponding connection device according to a predetermined program. The touch panel 1071 may include two parts of a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 110, and can receive and execute commands sent by the processor 110. In addition, the touch panel 1071 may be implemented in various types, such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. In addition to the touch panel 1071, the user input unit 107 may include other input devices 1072. In particular, other input devices 1072 may include, but are not limited to, one or more of a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like, and are not limited to these specific examples.
Further, the touch panel 1071 may cover the display panel 1061, and when the touch panel 1071 detects a touch operation thereon or nearby, the touch panel 1071 transmits the touch operation to the processor 110 to determine the type of the touch event, and then the processor 110 provides a corresponding visual output on the display panel 1061 according to the type of the touch event. Although the touch panel 1071 and the display panel 1061 are shown in fig. 1 as two separate components to implement the input and output functions of the mobile terminal, in some embodiments, the touch panel 1071 and the display panel 1061 may be integrated to implement the input and output functions of the mobile terminal, and is not limited herein.
The interface unit 108 serves as an interface through which at least one external device is connected to the mobile terminal 100. For example, the external device may include a wired or wireless headset port, an external power supply (or battery charger) port, a wired or wireless data port, a memory card port, a port for connecting a device having an identification module, an audio input/output (I/O) port, a video I/O port, an earphone port, and the like. The interface unit 108 may be used to receive input (e.g., data information, power, etc.) from an external device and transmit the received input to one or more elements within the terminal 100 or may be used to transmit data between the terminal 100 and the external device.
The memory 109 may be used to store software programs as well as various data. The memory 109 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 109 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.
The processor 110 is a control center of the terminal, connects various parts of the entire terminal using various interfaces and lines, and performs various functions of the terminal and processes data by operating or executing software programs and/or modules stored in the memory 109 and calling data stored in the memory 109, thereby performing overall monitoring of the terminal. Processor 110 may include one or more processing units; preferably, the processor 110 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 110.
The terminal 100 may further include a power supply 111 (e.g., a battery) for supplying power to various components, and preferably, the power supply 111 may be logically connected to the processor 110 through a power management system, so as to manage charging, discharging, and power consumption management functions through the power management system.
Although not shown in fig. 1, the terminal 100 may further include a bluetooth module or the like, which is not described in detail herein.
In order to facilitate understanding of the embodiments of the present invention, a description is given below of a communication network system on which the terminal of the present invention, particularly a mobile terminal is based.
Referring to fig. 2, fig. 2 is an architecture diagram of a communication Network system according to an embodiment of the present invention, where the communication Network system is an LTE system of a universal mobile telecommunications technology, and the LTE system includes a UE (User Equipment) 201, an E-UTRAN (Evolved UMTS Terrestrial Radio Access Network) 202, an EPC (Evolved Packet Core) 203, and an IP service 204 of an operator, which are in communication connection in sequence.
Specifically, the UE201 may be the terminal 100 described above, and is not described herein again.
The E-UTRAN202 includes eNodeB2021 and other eNodeBs 2022, among others. Among them, the eNodeB2021 may be connected with other eNodeB2022 through backhaul (e.g., X2 interface), the eNodeB2021 is connected to the EPC203, and the eNodeB2021 may provide the UE201 access to the EPC 203.
The EPC203 may include an MME (Mobility Management Entity) 2031, an HSS (Home Subscriber Server) 2032, other MMEs 2033, an SGW (Serving gateway) 2034, a PGW (PDN gateway) 2035, and a PCRF (Policy and Charging Rules Function) 2036, and the like. The MME2031 is a control node that handles signaling between the UE201 and the EPC203, and provides bearer and connection management. HSS2032 is used to provide registers to manage functions such as home location register (not shown) and holds subscriber specific information about service characteristics, data rates, etc. All user data may be sent through SGW2034, PGW2035 may provide IP address assignment for UE201 and other functions, and PCRF2036 is a policy and charging control policy decision point for traffic data flow and IP bearer resources, which selects and provides available policy and charging control decisions for a policy and charging enforcement function (not shown).
The IP services 204 may include the internet, intranets, IMS (IP Multimedia Subsystem), or other IP services, among others.
Although the LTE system is described as an example, it should be understood by those skilled in the art that the present invention is not limited to the LTE system, but may also be applied to other wireless communication systems, such as GSM, CDMA2000, WCDMA, TD-SCDMA, and future new network systems.
Based on the above hardware structure of the terminal 100 and the communication network system, various embodiments of the method of the present invention are proposed.
First, the present invention provides a photographing method based on voice recognition, which is applied to a mobile terminal shown in fig. 1 to 2, where the mobile terminal includes a memory and a processor. Fig. 3 is a flowchart illustrating a first embodiment of the photographing method according to the present invention. In this embodiment, the execution order of the steps in the flowchart shown in fig. 3 may be changed and some steps may be omitted according to different requirements. The photographing method comprises the following steps:
step S301, a voice photographing instruction to be recognized is obtained.
In this embodiment, a user issues a voice photographing instruction to be recognized (the voice signal is with a pause), and the mobile terminal obtains the voice photographing instruction to be recognized, and the voice photographing instruction will have noise under the environment, in short, the voice signal in the voice photographing instruction to be recognized is substantially composed of a series of silent segments and voiced segments, and the information that the mobile terminal needs to obtain is in the voiced segments, so the voice photographing instruction to be recognized collected here needs to be processed. Preferably, after the voice photographing instruction to be recognized is obtained, the voice photographing instruction to be recognized needs to be preprocessed; the pretreatment here includes: frequency reduction and denoising, and end point detection. Specifically, a processor of the mobile terminal acquires the voice photographing instruction to be recognized, and performs frequency reduction and denoising processing on a sound signal in the voice photographing instruction to enhance the intensity of the sound signal and reduce the noise amplitude; and then, performing endpoint detection, performing voice activity detection on the voice signal subjected to frequency reduction and denoising, and determining a starting endpoint and an ending endpoint of the voice signal.
Step S302, extracting the voice signal characteristics in the voice photographing instruction to be recognized.
Specifically, after the voice photographing instruction to be recognized is preprocessed, the voice signal characteristics can be obtained through various voice analysis methods.
Step S303, searching whether a standard voice signal characteristic matched with the voice signal characteristic exists in a preset voice characteristic database; the preset voice feature database is a correlation database of preset voice photographing instructions and standard voice signal features corresponding to the preset voice photographing instructions, and the preset voice photographing instructions comprise preset photographing keywords and photographing modes corresponding to the photographing keywords.
Specifically, whether a feature interval of the standard voice signal feature exists is searched in the preset voice feature database, so that the voice signal feature does not exceed the feature interval. In addition, the preset voice photographing instruction comprises a preset photographing keyword and a photographing mode corresponding to the photographing keyword, and it is to be noted that, different from the prior art, in this embodiment, the preset photographing instruction further comprises a photographing mode corresponding to the photographing keyword, where the photographing mode may be a long-range photographing mode, a beautiful-face photographing mode, a timing photographing mode, a still photographing mode, and the like, and through the photographing mode corresponding to the photographing keyword, the user sends a voice instruction of different photographing keywords to the mobile terminal, and according to the different photographing modes corresponding to the different photographing keywords, the purpose of adjusting the photographing mode of the camera through the voice photographing instruction can be achieved, so as to enhance human-computer interaction and improve user experience.
In addition, the preset voice feature database may be generated by a system, as a further improvement of this embodiment, or may be user-defined, and before the "obtaining the voice photographing instruction to be recognized", the preset voice feature database is further constructed; the method of "constructing the preset speech feature database" includes the following steps, as shown in fig. 4.
And S401, collecting the preset voice photographing instruction.
Specifically, here, the user opens the collection module of the mobile terminal for setting, as shown in fig. 8, the mobile terminal collects a preset voice photographing instruction, the mobile terminal corresponds to different photographing keywords of the mobile phone in different photographing modes, the selection of the photographing keywords is completely set by the user, and if the user likes english pronunciation, the photographing keywords can be all set to english; if the custom dialect is used, the dialect is selected as a photographing keyword, and the like, so that the personalized customization of the user can be realized; specifically, if the setting is performed in the beauty shooting mode, the voice of the "applet" is continuously input three times, and the voice of the "apple" is continuously input three times in the non-beauty shooting mode, and so on, and the setting is completed, the photographing keyword corresponding to the beauty shooting mode is the "applet", and the photographing keyword corresponding to the non-beauty shooting mode is the "apple". Here, it should be noted that, in general, according to the statistical principle, the same preset voice photographing instruction is collected for multiple times (3 times or more than 3 times), so as to avoid the collection error of the preset voice photographing instruction caused by the user error.
And step S402, extracting the standard voice signal characteristics corresponding to the preset voice photographing instruction.
Specifically, similar to step S302, after the preset voice photographing instruction is preprocessed, the standard voice signal features therein can be obtained through various voice analysis methods. The pretreatment here includes: the denoising and endpoint detection are the same as step S301, and will not be described in detail here. It should be separately noted that, when the same preset voice photographing instruction is acquired for a plurality of times, the extracted corresponding standard voice signal feature is a feature interval, not an isolated point.
Step S403, establishing an association database of the preset voice photographing instruction and a standard voice signal characteristic corresponding to the preset voice photographing instruction.
In this embodiment, specifically, the preset voice photographing instruction and the standard voice signal feature obtained in the foregoing step are collected together, and an association database in which the preset voice photographing instruction and the standard voice signal feature are mapped one by one is established. By establishing the customized associated database, on one hand, the accuracy of voice recognition can be improved, misjudgment is reduced, on the other hand, personalized customization of a user can be realized, and the interactivity of the user and the mobile terminal is enhanced.
Step S304, if the standard voice signal feature matched with the voice signal feature exists, executing the preset voice photographing instruction corresponding to the standard voice signal feature.
It should be noted that when a standard voice signal feature matched with the voice signal feature is found in a preset voice feature database, that is, the standard voice signal feature matched with the voice signal feature exists, which is equivalent to the voice signal feature of the voice photographing instruction to be recognized, the voice signal feature is recognized, and the preset voice photographing instruction corresponding to the standard voice signal feature is executed by the processor, that is, photographing is performed in a photographing mode of the preset voice photographing instruction, according to one-to-one mapping between the preset voice photographing instruction and the standard voice signal feature, in a preset feature interval of the standard voice signal feature stored in the voice feature database.
In addition, in another case in this embodiment, if there is no standard voice signal feature matching with the voice signal feature, the voice photographing instruction to be recognized is obtained again; or prompting to update the preset voice feature database. Here, it is further explained that, when the standard voice signal feature matching with the voice signal feature is not found in the preset voice feature database, that is, the standard voice signal feature matching with the voice signal feature does not exist, the voice signal feature equivalent to the voice photographing instruction to be recognized is not preset and stored in the feature interval of the standard voice signal feature of the voice feature database, and cannot be recognized, in this case, the mobile terminal actively ends the photographing process, enters the next photographing process, and re-acquires the voice photographing instruction to be recognized. However, there is a possibility that after the user sends out a voice photographing instruction for many times, the voice photographing instruction cannot be recognized and photographing cannot be performed; then, the user may consider a possible situation that the preset voice feature database has not collected the voice photographing instruction of the user yet, and the mobile terminal may directly jump out an option of prompting to update the preset voice feature database (of course, this situation is only possible in a case where the preset voice feature database can be customized), as shown in fig. 9, the user may select to reset the voice feature database.
Through the steps S301 to S304, compared with the prior art, the photographing method based on voice recognition provided by the present invention obtains the voice photographing instruction to be recognized; extracting voice signal characteristics in the voice photographing instruction to be recognized; searching whether a standard voice signal characteristic matched with the voice signal characteristic exists in a preset voice characteristic database; if the standard voice signal characteristics matched with the voice signal characteristics exist, executing the preset voice photographing instruction corresponding to the standard voice signal characteristics; the voice photographing method comprises the steps that a voice photographing instruction to be recognized is recognized based on a voice recognition technology, the preset voice photographing instruction comprises a preset photographing keyword and a photographing mode, and the preset voice photographing instruction is executed after the voice instruction to be recognized is recognized, so that the purposes of photographing and adjusting the photographing mode are achieved, the problem that the conventional voice instruction of a photographing program based on voice recognition cannot adjust the photographing mode is solved, and the effects of enhancing human-computer interaction and improving user experience are achieved; meanwhile, the convenience of photographing of the mobile terminal is improved, and better interaction experience is provided for users.
Further, based on the first embodiment described above, a second embodiment of the photographing method of the present invention is proposed. Fig. 5 is a flowchart illustrating a second embodiment of the photographing method according to the present invention. In this embodiment, the execution order of the steps in the flowchart shown in fig. 5 may be changed and some steps may be omitted according to different requirements. The photographing method comprises the following steps:
step S501, a voice photographing instruction to be recognized is obtained.
Step S502, extracting the first-stage voice signal characteristics in the voice photographing instruction to be recognized;
compared with the first embodiment, the voice signal features of the to-be-recognized voice photographing instruction in the embodiment include first-level voice signal features and second-level voice signal features, and the standard voice signal features include first-level standard voice signal features and second-level standard voice signal features. The first-stage voice signal features comprise energy features and amplitude features of the voice photographing instruction to be recognized, and the first-stage standard voice signal features comprise energy features and amplitude features of the preset voice photographing instruction; specifically, the energy characteristic refers to an energy distribution characteristic of a voice signal in the voice photographing instruction, the amplitude characteristic refers to an amplitude change of the voice signal in the voice photographing instruction, in this embodiment, the short-time energy and the zero-crossing rate can intuitively reflect the energy distribution characteristic and the amplitude change of the voice signal in the voice photographing instruction to be recognized, and are important parameters of a time domain of the voice signal, so that the energy distribution characteristic and the amplitude change are used as characteristic values of a first-level voice signal characteristic, and a calculation method of the energy distribution characteristic and the amplitude change is as follows
Assume that the short-term energy of the speech command to be recognized is En,EnThe calculation process of (a) is as follows:
Figure RE-GDA0001830639150000151
the voice photographing instruction to be recognized is subjected to framing to obtain n frames of data, and the nth frame is assumed to be Xn(m) then Xn(m) represents a time domain value of the nth frame data.
Suppose that the zero-crossing rate of the voice photographing instruction to be recognized is ZnThen Z isnThe calculation process of (a) can be expressed as follows:
Figure RE-GDA0001830639150000152
Sgn[]expressed as a symbolic function, i.e.:
Figure RE-GDA0001830639150000153
the first-stage standard speech signal features are similar to the first-stage speech signal features and will not be described in detail here.
Step S503, searching whether there is a first-level standard speech signal feature matching the first-level speech signal feature in the preset speech feature database.
Step S504, if the first-level standard voice signal characteristics matched with the first-level voice signal characteristics exist, extracting second-level voice signal characteristics in the voice photographing instruction to be recognized.
It should be noted that the second-stage speech signal features are the same as the second-stage standard speech signal features described below, and the second-stage speech signal features are obtained based on an Empirical Mode Decomposition (EMD) method. Decomposing the voice photographing instruction to be recognized into a series of components with different frequency bands, calculating energy values of all the frequency band components as second-level voice signal characteristics, and decomposing the voice signals in the voice photographing instruction to be recognized as follows:
1) assuming that a voice instruction data sample to be decomposed is x (t), calculating all maximum value points and minimum value points in x (t), introducing a cubic spline interpolation algorithm to perform interpolation to obtain an upper envelope line and a lower envelope line of x (t), and calculating the average value of the upper envelope line and the lower envelope line to obtain m1Let h1Representing signals x (t) and m1A difference of (2) then
h1=x(t)-m1
2) If h is1If the defined condition of mode function is met, the above-mentioned step is continued, if it is not met, h is used1Replacing the signal x (t), repeating 1) -2), when:
h11=h1-m11
at this time, judge h11Whether the condition is satisfied. The above operations are repeated, and
h1k=h1(k-1)-m1k
when h is generated1kStopping screening when the modal condition is met, andtime h1kI.e. as the first-segment frequency component IMF1And is and
IMF1=h1k
3) let r be1Representing the difference between the original signal and IMF1, then r1Expressed as:
r1=x(t)-IMF1
4) will r is1Replacing the original signal x (t), continuing the steps 1) -3) to obtain n frequency components IMFnThe remainder is then expressed as follows:
r2=r1-IMF2,...,rn=rn-1-IMFn
combining all the above steps, the voice command x (t) is decomposed into a series of superpositions of different frequency band components and a residual component, that is:
Figure RE-GDA0001830639150000161
on the basis, calculating the energy Ei of each section of frequency component of the voice photographing instruction to be recognized.
Step S505, determining whether the second-level speech signal feature matches the second-level standard speech signal feature of the preset speech photographing instruction corresponding to the first-level standard speech signal feature.
In this embodiment, the second-level standard voice signal feature is similar to the second-level voice signal feature and is also obtained based on an Empirical Mode Decomposition (EMD) method, according to the EMD decomposition method, the energy of each frequency component of the preset voice photographing instruction is finally obtained as E, whether the energy Ei of each frequency component of the voice photographing instruction to be recognized is within a threshold interval of the energy of each frequency component of the preset voice photographing instruction as E is judged, and if the energy of each frequency component of the voice photographing instruction is not within the threshold interval, it is judged that the second-level voice signal feature is matched with the second-level standard voice signal feature.
Step S506, if the second-level standard voice signal feature matches the second-level voice signal feature, executing the preset voice photographing instruction corresponding to the first-level standard voice signal feature.
Through the steps S501 to S506, compared with the prior art, the photographing method based on voice recognition provided by the invention obtains the voice photographing instruction to be recognized; extracting the first-stage voice signal characteristics in the voice photographing instruction to be recognized; searching whether a first-stage standard voice signal characteristic matched with the first-stage voice signal characteristic exists in a preset voice characteristic database; if the first-level standard voice signal characteristics matched with the first-level voice signal characteristics exist, extracting second-level voice signal characteristics in the voice photographing instruction to be recognized; judging whether the second-level voice signal characteristics are matched with second-level standard voice signal characteristics of the preset voice photographing instruction corresponding to the first-level standard voice signal characteristics; if the second-level standard voice signal characteristics are matched with the second-level voice signal characteristics, executing the preset voice photographing instruction corresponding to the first-level standard voice signal characteristics; judging (or identifying) a voice photographing instruction submitted by a user through two-stage characteristics, and selecting short-time energy and zero crossing rate to perform initial judgment in a first-stage characteristic judgment process by combining the energy distribution characteristic of a voice signal and signal amplitude change; in the second-stage characteristic judgment process, the voice signal is gradually and iteratively decomposed into a series of waveforms with different frequency bands in combination with different frequency distributions of the voice signal, the decomposed components respectively comprise local characteristic signals with different time scales, and the basic components are decomposed by data per se, so that compared with the method of extracting different frequency bands of the signal by methods such as short-time Fourier transform, wavelet decomposition and the like, the second-stage characteristic judgment method is intuitive, direct, posterior and adaptive, and further has self-adaptability because the decomposition is based on the local characteristic of the time scale of the signal sequence; in summary, the photographing method of the embodiment has adaptivity when performing voice recognition, and improves user experience.
A terminal comprising a memory, a processor, and a voice recognition based photographing program 600 stored on the memory and executable on the processor. The module referred to herein is a series of computer program instruction segments capable of performing a specific function, and is more suitable than a computer program for describing the execution process of software in the terminal 100.
Fig. 6 is a functional block diagram of a photographing program 600 according to a first embodiment of the present invention. In this embodiment, the photographing program 600 may be divided into one or more modules, and the one or more modules are stored in the memory 109 of the terminal 100 and executed by one or more processors (in this embodiment, the controller 110) to complete the present invention. For example, in fig. 6, the photographing program 600 may be divided into an instruction acquisition module 601, a feature extraction module 602, a feature lookup module 603, and an execution module 604. The detailed description of the functions of the functional modules 601-604 will be described in detail below. Wherein:
the instruction obtaining module 601 is configured to obtain a voice photographing instruction to be recognized.
In this embodiment, a user issues a voice photographing instruction to be recognized (the voice signal is with a pause), and the mobile terminal obtains the voice photographing instruction to be recognized, and the voice photographing instruction will have noise under the environment, in short, the voice signal in the voice photographing instruction to be recognized is substantially composed of a series of silent segments and voiced segments, and the information that the mobile terminal needs to obtain is in the voiced segments, so the voice photographing instruction to be recognized collected here needs to be processed. Preferably, after the voice photographing instruction to be recognized is obtained, the voice photographing instruction to be recognized needs to be preprocessed; the pretreatment here includes: frequency reduction and denoising, and end point detection. Specifically, a processor of the mobile terminal acquires the voice photographing instruction to be recognized, and performs frequency reduction and denoising processing on a sound signal in the voice photographing instruction to enhance the intensity of the sound signal and reduce the noise amplitude; and then, performing endpoint detection, performing voice activity detection on the voice signal subjected to frequency reduction and denoising, and determining a starting endpoint and an ending endpoint of the voice signal.
And the feature extraction module 602 is configured to extract a voice signal feature in the voice photographing instruction to be recognized.
Specifically, after the voice photographing instruction to be recognized is preprocessed, the voice signal characteristics can be obtained through various voice analysis methods.
A feature searching module 603, configured to search, in a preset speech feature database, whether a standard speech signal feature matching the speech signal feature exists; the preset voice feature database is a correlation database of preset voice photographing instructions and standard voice signal features corresponding to the preset voice photographing instructions, and the preset voice photographing instructions comprise preset photographing keywords and photographing modes corresponding to the photographing keywords.
Specifically, whether a feature interval of the standard voice signal feature exists is searched in the preset voice feature database, so that the voice signal feature does not exceed the feature interval. In addition, the preset voice photographing instruction comprises a preset photographing keyword and a photographing mode corresponding to the photographing keyword, and it is to be noted that, different from the prior art, in this embodiment, the preset photographing instruction further comprises a photographing mode corresponding to the photographing keyword, where the photographing mode may be a long-range photographing mode, a beautiful-face photographing mode, a timing photographing mode, a still photographing mode, and the like, and through the photographing mode corresponding to the photographing keyword, the user sends a voice instruction of different photographing keywords to the mobile terminal, and according to the different photographing modes corresponding to the different photographing keywords, the purpose of adjusting the photographing mode of the camera through the voice photographing instruction can be achieved, so as to enhance human-computer interaction and improve user experience.
In addition, the preset voice feature database may be generated by a system, as a further improvement of this embodiment, or may be user-defined, and before the "obtaining the voice photographing instruction to be recognized", the preset voice feature database is further constructed; wherein the program of "building the preset speech feature database" comprises the following modules.
And the acquisition module acquires the preset voice photographing instruction.
Here, the user opens the acquisition module of the mobile terminal for setting, as shown in fig. 8, so as to acquire a preset voice photographing instruction, the mobile terminal corresponds to different photographing keywords of the mobile phone in different photographing modes, the selection of the photographing keywords is completely set by the user, and if the user likes english pronunciation, the photographing keywords can be all set to english; if the custom dialect is used, the dialect is selected as a photographing keyword, and the like, so that the personalized customization of the user can be realized; specifically, if the setting is performed in the beauty shooting mode, the voice of the "applet" is continuously input three times, and the voice of the "apple" is continuously input three times in the non-beauty shooting mode, and so on, and the setting is completed, the photographing keyword corresponding to the beauty shooting mode is the "applet", and the photographing keyword corresponding to the non-beauty shooting mode is the "apple". Here, it should be noted that, in general, according to the statistical principle, the same preset voice photographing instruction is collected for multiple times (3 times or more than 3 times), so as to avoid the collection error of the preset voice photographing instruction caused by the user error.
And the extraction module is used for extracting the standard voice signal characteristics corresponding to the preset voice photographing instruction.
Specifically, similar to the feature extraction module 602, after the preset voice photographing instruction is preprocessed, the standard voice signal features therein can be obtained through various voice analysis methods. The pretreatment here includes: the frequency reduction and noise reduction, and the endpoint detection are the same as the instruction obtaining module 601, and will not be described in detail here. It should be separately noted that, when the same preset voice photographing instruction is acquired for a plurality of times, the extracted corresponding standard voice signal feature is a feature interval, not an isolated point.
And the database establishing module is used for establishing a correlation database of the preset voice photographing instruction and the standard voice signal characteristics corresponding to the preset voice photographing instruction.
In this embodiment, specifically, the preset voice photographing instruction and the standard voice signal feature acquired in the foregoing module are collected together, and an association database in which the preset voice photographing instruction and the standard voice signal feature are mapped one by one is established. By establishing the customized associated database, on one hand, the accuracy of voice recognition can be improved, misjudgment is reduced, on the other hand, personalized customization of a user can be realized, and the interactivity of the user and the mobile terminal is enhanced.
The executing module 604 executes the preset voice photographing instruction corresponding to the standard voice signal feature if the standard voice signal feature matched with the voice signal feature exists.
It should be noted that when a standard voice signal feature matched with the voice signal feature is found in a preset voice feature database, that is, the standard voice signal feature matched with the voice signal feature exists, which is equivalent to the voice signal feature of the voice photographing instruction to be recognized, the voice signal feature is recognized, and the processor executes the preset voice photographing instruction corresponding to the standard voice signal feature, that is, performs photographing in the photographing mode of the preset voice photographing instruction, according to the one-to-one mapping between the preset voice photographing instruction and the standard voice signal feature, in the preset voice feature interval stored in the voice feature database of the standard voice signal feature.
In addition, in another case in this embodiment, if there is no standard voice signal feature matching with the voice signal feature, the voice photographing instruction to be recognized is obtained again; or prompting to update the preset voice feature database. Here, it is further explained that, when the standard voice signal feature matching with the voice signal feature is not found in the preset voice feature database, that is, the standard voice signal feature matching with the voice signal feature does not exist, the voice signal feature equivalent to the voice photographing instruction to be recognized is not preset and stored in the feature interval of the standard voice signal feature of the voice feature database, and cannot be recognized, in this case, the mobile terminal actively ends the photographing process, enters the next photographing process, and re-acquires the voice photographing instruction to be recognized. However, there is a possibility that after the user sends out a voice photographing instruction for many times, the voice photographing instruction cannot be recognized and photographing cannot be performed; then, the user may consider a possible situation that the preset voice feature database has not collected the voice photographing instruction of the user yet, and the mobile terminal may directly jump out an option of prompting to update the preset voice feature database (of course, this situation is only possible in a case where the preset voice feature database can be customized), as shown in fig. 9, the user may select to reset the voice feature database.
Through the module 601 and 604, compared with the prior art, the mobile terminal provided by the invention has the advantages that the voice photographing instruction to be recognized is obtained through the voice recognition-based photographing program; extracting voice signal characteristics in the voice photographing instruction to be recognized; searching whether a standard voice signal characteristic matched with the voice signal characteristic exists in a preset voice characteristic database; if the standard voice signal characteristics matched with the voice signal characteristics exist, executing the preset voice photographing instruction corresponding to the standard voice signal characteristics; the voice photographing method comprises the steps that a voice photographing instruction to be recognized is recognized based on a voice recognition technology, the preset voice photographing instruction comprises a preset photographing keyword and a photographing mode, and the preset voice photographing instruction is executed after the voice instruction to be recognized is recognized, so that the purposes of photographing and adjusting the photographing mode are achieved, the problem that the conventional voice instruction of a photographing program based on voice recognition cannot adjust the photographing mode is solved, and the effects of enhancing human-computer interaction and improving user experience are achieved; meanwhile, the convenience of photographing of the mobile terminal is improved, and better interaction experience is provided for users.
Further, based on the above-described first embodiment, a second embodiment of the photographing program of the present invention is proposed. Fig. 7 is a schematic diagram of functional modules of a photographing program according to a second embodiment of the present invention. In this embodiment, compared with the first embodiment, in this embodiment, the photographing program 700 includes an instruction obtaining module 701, a first-level feature extracting module 702, a first-level feature searching module 703, a second-level feature extracting module 704, a determining module 705, and an executing module 706. In this embodiment, each functional module is described as follows:
the instruction obtaining module 701 is configured to obtain a voice photographing instruction to be recognized.
A first-stage feature extraction module 702, configured to extract a first-stage voice signal feature in the voice photographing instruction to be recognized;
compared with the first embodiment, the voice signal features of the to-be-recognized voice photographing instruction in the embodiment include first-level voice signal features and second-level voice signal features, and the standard voice signal features include first-level standard voice signal features and second-level standard voice signal features. The first-stage voice signal features comprise energy features and amplitude features of the voice photographing instruction to be recognized, and the first-stage standard voice signal features comprise energy features and amplitude features of the preset voice photographing instruction; specifically, the energy characteristic refers to an energy distribution characteristic of a voice signal in the voice photographing instruction, the amplitude characteristic refers to an amplitude change of the voice signal in the voice photographing instruction, in this embodiment, the short-time energy and the zero-crossing rate can intuitively reflect the energy distribution characteristic and the amplitude change of the voice signal in the voice photographing instruction to be recognized, and are important parameters of a time domain of the voice signal, so that the energy distribution characteristic and the amplitude change are used as characteristic values of a first-level voice signal characteristic, and a calculation method of the energy distribution characteristic and the amplitude change is as follows
Assume that the short-term energy of the speech command to be recognized is En,EnThe calculation process of (a) is as follows:
Figure RE-GDA0001830639150000221
the voice photographing instruction to be recognized is subjected to framing to obtain n frames of data, and the nth frame is assumed to be Xn(m) then Xn(m) represents a time domain value of the nth frame data.
Suppose that the zero-crossing rate of the voice photographing instruction to be recognized is ZnThen Z isnThe calculation process of (a) can be expressed as follows:
Figure RE-GDA0001830639150000222
Sgn[]expressed as a symbolic function, i.e.:
Figure RE-GDA0001830639150000223
the first-stage standard speech signal features are similar to the first-stage speech signal features and will not be described in detail here.
A first-stage feature searching module 703, configured to search, in the preset voice feature database, whether a first-stage standard voice signal feature matching the first-stage voice signal feature exists.
And the second-level feature extraction module 704 is configured to extract a second-level voice signal feature in the to-be-recognized voice photographing instruction if a first-level standard voice signal feature matched with the first-level voice signal feature exists.
It should be noted that the second-stage speech signal features are the same as the second-stage standard speech signal features described below, and the second-stage speech signal features are obtained based on an Empirical Mode Decomposition (EMD) method. Decomposing the voice photographing instruction to be recognized into a series of components with different frequency bands, calculating energy values of all the frequency band components as second-level voice signal characteristics, and decomposing the voice signals in the voice photographing instruction to be recognized as follows:
1) assuming that a voice instruction data sample to be decomposed is x (t), calculating all maximum value points and minimum value points in x (t), introducing a cubic spline interpolation algorithm to perform interpolation to obtain an upper envelope line and a lower envelope line of x (t), and calculating the average value of the upper envelope line and the lower envelope line to obtain m1Let h1Representing signals x (t) and m1A difference of (2) then
h1=x(t)-m1
2) If h is1If the defined condition of mode function is met, the above step is continued, if not, the defined condition of mode function is metConditions of use h1Replacing the signal x (t), repeating 1) -2), when:
h11=h1-m11
at this time, judge h11Whether the condition is satisfied. The above operations are repeated, and
h1k=h1(k-1)-m1k
when h is generated1kStopping screening when the modal condition is met, and h at the moment1kI.e. as the first-segment frequency component IMF1And is and
IMF1=h1k
3) let r be1Representing the difference between the original signal and IMF1, then r1Expressed as:
r1=x(t)-IMF1
4) will r is1Replacing the original signal x (t), continuing the steps 1) -3) to obtain n frequency components IMFnThe remainder is then expressed as follows:
r2=r1-IMF2,...,rn=rn-1-IMFn
combining all the above steps, the voice command x (t) is decomposed into a series of superpositions of different frequency band components and a residual component, that is:
Figure RE-GDA0001830639150000231
on the basis, calculating the energy Ei of each section of frequency component of the voice photographing instruction to be recognized.
The determining module 705 is configured to determine whether the second-level voice signal feature matches with a second-level standard voice signal feature of the preset voice photographing instruction corresponding to the first-level standard voice signal feature.
In this embodiment, the second-level standard voice signal feature is similar to the second-level voice signal feature and is also obtained based on an Empirical Mode Decomposition (EMD) method, according to the EMD decomposition method, the energy of each frequency component of the preset voice photographing instruction is finally obtained as E, whether the energy Ei of each frequency component of the voice photographing instruction to be recognized is within a threshold interval of the energy of each frequency component of the preset voice photographing instruction as E is judged, and if the energy of each frequency component of the voice photographing instruction is not within the threshold interval, it is judged that the second-level voice signal feature is matched with the second-level standard voice signal feature.
The executing module 706 executes the preset voice photographing instruction corresponding to the first-level standard voice signal feature if the second-level standard voice signal feature matches the second-level voice signal feature.
Through the module 701 and 706, compared with the prior art, the photographing program based on voice recognition provided by the invention obtains the voice photographing instruction to be recognized; extracting the first-stage voice signal characteristics in the voice photographing instruction to be recognized; searching whether a first-stage standard voice signal characteristic matched with the first-stage voice signal characteristic exists in a preset voice characteristic database; if the first-level standard voice signal characteristics matched with the first-level voice signal characteristics exist, extracting second-level voice signal characteristics in the voice photographing instruction to be recognized; judging whether the second-level voice signal characteristics are matched with second-level standard voice signal characteristics of the preset voice photographing instruction corresponding to the first-level standard voice signal characteristics; if the second-level standard voice signal characteristics are matched with the second-level voice signal characteristics, executing the preset voice photographing instruction corresponding to the first-level standard voice signal characteristics; judging (or identifying) a voice photographing instruction submitted by a user through two-stage characteristics, and selecting short-time energy and zero crossing rate to perform initial judgment in a first-stage characteristic judgment process by combining the energy distribution characteristic of a voice signal and signal amplitude change; in the second-stage characteristic judgment process, the voice signals are gradually and iteratively decomposed into a series of waveforms with different frequency bands in combination with different frequency distributions of the voice signals, the components obtained by decomposition respectively comprise local characteristic signals with different time scales, and because the basic components are obtained by decomposing the data, compared with the method of extracting different frequency bands of the signals by short-time Fourier transform, wavelet decomposition and other methods, the second-stage characteristic judgment is intuitive, direct, posterior and adaptive, and further has self-adaptability because the decomposition is based on the local characteristics of the time scales of the signal sequences; in summary, the photographing program of the embodiment has adaptivity when performing voice recognition, and improves user experience.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (7)

1. A photographing method based on voice recognition is applied to a mobile terminal and is characterized by comprising the following steps:
acquiring a voice photographing instruction to be recognized;
extracting voice signal characteristics in the voice photographing instruction to be recognized, wherein the voice signal characteristics comprise first-level voice signal characteristics and second-level voice signal characteristics, and the standard voice signal characteristics comprise first-level standard voice signal characteristics and second-level standard voice signal characteristics;
searching whether a standard voice signal characteristic matched with the voice signal characteristic exists in a preset voice characteristic database, wherein the method comprises the following steps: extracting the first-stage voice signal characteristics in the voice photographing instruction to be recognized;
searching whether a first-stage standard voice signal characteristic matched with the first-stage voice signal characteristic exists in the preset voice characteristic database;
if the first-level standard voice signal characteristics matched with the first-level voice signal characteristics exist, extracting second-level voice signal characteristics in the voice photographing instruction to be recognized;
judging whether the second-level voice signal characteristics are matched with second-level standard voice signal characteristics of a preset photographing instruction corresponding to the first-level standard voice signal characteristics;
if the second-level standard voice signal characteristics are matched with the second-level voice signal characteristics, executing a preset voice photographing instruction corresponding to the first-level standard voice signal characteristics; the preset voice feature database is a correlation database of a preset voice photographing instruction and the standard voice signal feature corresponding to the preset voice photographing instruction, and the preset voice photographing instruction comprises a preset photographing keyword and a photographing mode corresponding to the photographing keyword;
if the standard voice signal characteristics matched with the voice signal characteristics exist, executing the preset voice photographing instruction corresponding to the standard voice signal characteristics;
if the standard voice signal characteristics matched with the voice signal characteristics do not exist, the voice photographing instruction to be recognized is obtained again; or prompting to update the preset voice feature database.
2. A photographing method as defined in claim 1, characterized in that: the first-level voice signal features comprise energy features and amplitude features of the voice photographing instruction to be recognized, and the first-level standard voice signal features comprise the energy features and the amplitude features of the preset voice photographing instruction.
3. The photographing method according to claim 1, wherein the step of searching for the presence of the standard voice signal feature matching the voice signal feature in a preset voice feature database comprises the steps of:
and searching whether a characteristic interval of the standard voice signal characteristic exists in the preset voice characteristic database so as to enable the voice signal characteristic not to exceed the characteristic interval.
4. The photographing method of claim 1, wherein before the step of obtaining the voice photographing instruction to be recognized, the step of constructing the preset voice feature database is further included; the step of building the preset voice feature database comprises the following steps:
collecting the preset voice photographing instruction;
extracting the standard voice signal characteristics corresponding to the preset voice photographing instruction;
and establishing an associated database of the preset voice photographing instruction and the standard voice signal characteristics corresponding to the preset voice photographing instruction.
5. The photographing method according to claim 1, wherein before extracting the voice signal feature in the voice photographing instruction to be recognized, the method further comprises the following steps:
and preprocessing the voice photographing instruction to be recognized.
6. A mobile terminal, characterized in that the mobile terminal comprises a memory, a processor and a photographing program based on voice recognition stored on the memory and operable on the processor, the photographing program when executed by the processor implementing the steps of the photographing method according to any one of claims 1 to 5.
7. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a voice recognition-based photographing program, which when executed by a processor implements the steps of the photographing method according to any one of claims 1 to 5.
CN201810955505.1A 2018-08-21 2018-08-21 Photographing method, mobile terminal and computer readable storage medium Active CN109302528B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810955505.1A CN109302528B (en) 2018-08-21 2018-08-21 Photographing method, mobile terminal and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810955505.1A CN109302528B (en) 2018-08-21 2018-08-21 Photographing method, mobile terminal and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN109302528A CN109302528A (en) 2019-02-01
CN109302528B true CN109302528B (en) 2021-05-25

Family

ID=65165310

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810955505.1A Active CN109302528B (en) 2018-08-21 2018-08-21 Photographing method, mobile terminal and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN109302528B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110602391B (en) * 2019-08-30 2021-08-24 Oppo广东移动通信有限公司 Photographing control method and device, storage medium and electronic equipment
CN110475069B (en) * 2019-09-03 2021-05-07 腾讯科技(深圳)有限公司 Image shooting method and device
CN111432124A (en) * 2020-03-30 2020-07-17 深圳创维-Rgb电子有限公司 Photographing method, television and storage medium
CN111565281A (en) * 2020-05-07 2020-08-21 Oppo广东移动通信有限公司 Photographing method and device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009156888A (en) * 2007-12-25 2009-07-16 Sanyo Electric Co Ltd Speech corrector and imaging apparatus equipped with the same, and sound correcting method
CN101685634A (en) * 2008-09-27 2010-03-31 上海盛淘智能科技有限公司 Children speech emotion recognition method
CN104978960A (en) * 2015-07-01 2015-10-14 陈包容 Photographing method and device based on speech recognition
CN105931637A (en) * 2016-04-01 2016-09-07 金陵科技学院 User-defined instruction recognition speech photographing system

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120050570A1 (en) * 2010-08-26 2012-03-01 Jasinski David W Audio processing based on scene type
WO2014144579A1 (en) * 2013-03-15 2014-09-18 Apple Inc. System and method for updating an adaptive speech recognition model
KR20150112337A (en) * 2014-03-27 2015-10-07 삼성전자주식회사 display apparatus and user interaction method thereof

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009156888A (en) * 2007-12-25 2009-07-16 Sanyo Electric Co Ltd Speech corrector and imaging apparatus equipped with the same, and sound correcting method
CN101685634A (en) * 2008-09-27 2010-03-31 上海盛淘智能科技有限公司 Children speech emotion recognition method
CN104978960A (en) * 2015-07-01 2015-10-14 陈包容 Photographing method and device based on speech recognition
CN105931637A (en) * 2016-04-01 2016-09-07 金陵科技学院 User-defined instruction recognition speech photographing system

Also Published As

Publication number Publication date
CN109302528A (en) 2019-02-01

Similar Documents

Publication Publication Date Title
CN108572764B (en) Character input control method and device and computer readable storage medium
CN109302528B (en) Photographing method, mobile terminal and computer readable storage medium
CN109036420B (en) Voice recognition control method, terminal and computer readable storage medium
CN109348067B (en) Method for adjusting screen display brightness, mobile terminal and computer readable storage medium
CN107592415B (en) Voice transmission method, terminal, and computer-readable storage medium
CN108600325B (en) Push content determining method, server and computer readable storage medium
CN110033769B (en) Recorded voice processing method, terminal and computer readable storage medium
CN107148055B (en) Flow control method, mobile terminal and computer readable storage medium
CN107168626B (en) Information processing method and device and computer readable storage medium
CN109545221B (en) Parameter adjustment method, mobile terminal and computer readable storage medium
CN112489647A (en) Voice assistant control method, mobile terminal and storage medium
CN114761926A (en) Information acquisition method, terminal and computer storage medium
CN109167880B (en) Double-sided screen terminal control method, double-sided screen terminal and computer readable storage medium
CN109686359B (en) Voice output method, terminal and computer readable storage medium
CN113314120B (en) Processing method, processing apparatus, and storage medium
CN107817898B (en) Operation mode identification method, terminal and storage medium
CN107613109B (en) Input method of mobile terminal, mobile terminal and computer storage medium
CN113126844A (en) Display method, terminal and storage medium
CN109453526B (en) Sound processing method, terminal and computer readable storage medium
CN108255389B (en) Image editing method, mobile terminal and computer readable storage medium
CN110275667B (en) Content display method, mobile terminal, and computer-readable storage medium
CN109656658B (en) Editing object processing method and device and computer readable storage medium
CN109885171B (en) File operation method and terminal equipment
CN114065168A (en) Information processing method, intelligent terminal and storage medium
CN112672213A (en) Video information processing method and device and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant