WO2021258797A1

WO2021258797A1 - Image information input method, electronic device, and computer readable storage medium

Info

Publication number: WO2021258797A1
Application number: PCT/CN2021/083140
Authority: WO
Inventors: 唐吴全; 王斌; 张腾; 秦佳美
Original assignee: 华为技术有限公司
Priority date: 2020-06-24
Filing date: 2021-03-26
Publication date: 2021-12-30
Also published as: CN111881315A

Abstract

An image information input method, an electronic device, and a computer readable storage medium, which are applicable to the technical field of terminals. The image information input method comprises: obtaining an image to be processed (S401); performing classification processing on said image, and obtaining a first classification result (S402); selecting a corresponding classification model according to the first classification result, inputting said image into the classification model, and obtaining a second classification result outputted by the classification model (S403); and using the second classification result as the information label input for said image (S404). In the described method, complete and accurate semantic recognition can be performed on an image during visual input, thereby improving the accuracy of visual input information.

Description

Image information input method, electronic equipment and computer readable storage medium

This application claims the priority of a Chinese patent application submitted to the State Intellectual Property Office on June 24, 2020, the application number is 202010589458.0, and the application name is "Image Information Input Method, Electronic Equipment, and Computer Readable Storage Medium", and its entire content Incorporated in this application by reference.

Technical field

This application relates to the field of terminals, and in particular to an image information input method, electronic equipment, and computer-readable storage media.

Background technique

Information input is an important function of electronic equipment. Whether inquiring information on the Internet, or sending emails, information, etc., users need to enter relevant information on electronic devices. With the development of visual processing technology, the visual input function is gradually added to the information input method. When a user uses the visual input function, he inputs an image into an electronic device, and the electronic device recognizes the semantic information contained in the image and uses the recognized semantic information as input information. Different from the traditional way of directly inputting text information by the user, the visual input function can automatically "guess" the semantic information that the user wants to express based on the image input by the user, which improves the convenience of information input.

However, the existing visual input function can only perform simple semantic recognition on the image input by the user. For example, when performing semantic recognition on an image with text, only the text is segmented from the image, and then the segmented text is used as the semantic information of the image, but the part of the image without text cannot be semantically recognized. In the existing methods, since only simple semantic recognition can be performed on the image input by the user, the recognized semantic information does not contain all the semantics in the image, resulting in the existing visual input function being unable to accurately "express" the user's thoughts. The information to be input, and thus the accuracy of the input information cannot be guaranteed.

Summary of the invention

This application provides an image information input method, electronic equipment, and computer storage medium, which can improve the accuracy of visual input information.

In order to achieve the above objectives, this application adopts the following technical solutions:

In the first aspect, an embodiment of the present application provides an image information input method, the method includes: acquiring a to-be-processed image; classifying the to-be-processed image to obtain a first classification result; selecting according to the first classification result Corresponding classification model, input the image to be processed into the classification model to obtain a second classification result output by the classification model; input the second classification result as the information label of the image to be processed.

In the above-mentioned image information input method, the image to be processed is classified twice. Compared with the classification only once, the classification result obtained after the two classifications is more accurate. Furthermore, in the process of performing the second classification, since it is the classification model selected according to the first classification result obtained from the first classification, the second classification is equivalent to the reclassification of the first classification result, namely The granularity level of the second classification is lower than the granularity level of the first classification. In other words, the second classification result obtained by the second classification is more accurate than the first classification result, which improves the accuracy of semantic recognition of the image to be processed, and then when the second classification result is input as the information label of the image to be processed , Can improve the accuracy of visual input information, has strong ease of use and practicality.

With reference to the first aspect, in some embodiments, the first classification result includes at least one first category label. The selecting a corresponding classification model according to the first classification result, inputting the to-be-processed image into the classification model, and obtaining the second classification result output by the classification model includes: extracting from the to-be-processed image Sub-images corresponding to each of the first category labels in the first classification result, and obtaining the classification model corresponding to each of the first category labels in the first classification result; The sub-image corresponding to the i-th first category label is input into the classification model corresponding to the i-th first category label to obtain the sub-label of the i-th first category label, where i is less than or A positive integer equal to N, where N is the number of first category labels in the first classification result; each sub-label of the first category label in the first classification result is used as the second classification result.

In the above second classification process, each first category label in the first classification result corresponds to a classification model, and the respective classification models corresponding to the first category labels are used to classify the respective sub-images corresponding to the first category labels. That is, the more fine-grained classification of the first classification result is equivalent to dividing small classes on the basis of a large class, thereby improving the accuracy of semantic recognition of the image to be processed, and has strong ease of use and practicability.

With reference to the first aspect, in some embodiments, inputting the to-be-processed image into the classification model and obtaining the second classification result output by the classification model further includes: according to the first classification result and/or The second classification result is used to obtain the extended information corresponding to the first classification result and/or the second classification result, where the extended information corresponds to the first classification result and/or the second classification result relevant information. Correspondingly, the inputting the second classification result as the information label of the image to be processed includes: inputting the second classification result and/or the extended information as the information label of the image to be processed.

The extended information related to the first classification result and/or the second classification result is also input as the information label of the image to be processed, so that more semantic information can be "guessed" from the image to be processed, which increases the semantics of the image to be processed The richness of the recognition results ensures the completeness of the semantic recognition results of the image to be processed, thereby improving the intelligence of the visual input method.

With reference to the first aspect, in some embodiments, the acquiring the extended information corresponding to the first classification result and/or the second classification result includes: combining the first classification result and/or the second classification result The classification result is input into the preset instruction detection model, and the information query instruction output by the instruction detection model is obtained; according to the information query instruction, the first classification result and/or the extended information corresponding to the second classification result are inquired .

Among them, the preset instruction detection model can be used to reflect the user's information query habits, and query the extended information according to the information query instructions output by the instruction detection model. The information is closer to the semantic information that the user wants to express.

With reference to the first aspect, in some embodiments, the obtaining an image to be processed includes: obtaining a video to be processed, and extracting image information from the video to be processed, wherein the image information includes at least one frame of picture; Use at least one picture in the image information as the image to be processed.

With reference to the first aspect, in some embodiments, after the obtaining the to-be-processed video, it further includes: extracting audio information from the to-be-processed video; performing voice recognition processing on the audio information to obtain the audio information Information label. Correspondingly, the inputting the second classification result as the information label of the image to be processed includes: inputting the information label of the audio information and the second classification result as the information label of the image to be processed .

When the user input is video, you can extract image information and audio information from the video, perform image recognition processing on the image information to obtain the information label of the image, perform voice recognition processing on the audio information to obtain the audio information label, and then combine the image information Both the label and the audio information label are input as the image information label. Since image information and audio information are integrated into the information label of the final image to be processed, the diversity of the information label of the image to be processed is increased. Thereby, the richness of visual input information is improved, and the intelligence of the visual input method is improved, and it has strong ease of use and practicality.

With reference to the first aspect, in some embodiments, the second classification result includes at least one second category label; the information label of the audio information and the second classification result are used as the information of the image to be processed The information label input includes: if the same second category label exists in the second classification result, de-duplicating the second classification result; deduplicating the information label of the audio information and all the de-duplication processing The second classification result is input as the information label of the image to be processed.

With reference to the first aspect, in some embodiments, the inputting the information label of the audio information and the second classification result after deduplication processing as the information label of the image to be processed includes: If there is a first target label in the second classification result, the first target label is input as the information label of the image to be processed, wherein the first target label is the deduplicated processed image. The second category label in the second classification result that matches the information label of the audio information; if the first target label does not exist in the second classification result after deduplication processing, search for The extended information of the second category label in the second classification result is until the second target label is searched out, and the second target label is input as the information label of the image to be processed; wherein, the first The second target tag is the extended information of the second category tag that matches the information tag of the audio information.

The identified image information tags and audio information tags may have part of the same content and part of different content, and the same content is often semantic information that the user wants to express. Therefore, the part of the second classification result that matches the information label of the audio information is input as the information label of the image to be processed, that is, the same or similar semantic information contained in the image and audio is extracted, which is equivalent to the identification of the semantic information. There was a screening. As a result, the information label of the image to be processed is closer to the semantic information that the user wants to express, and the accuracy of the visual input information can be improved.

In other embodiments, the first target tag or the second target tag may also be displayed to the user as the first-preferred tag in the information tags of the image to be processed. For example: the information tags of the image to be processed are displayed to the user in the form of a sequence, and the first push tag is located at the beginning of the sequence. Another example: distinguish the font color of the first-preferred label from the font color of the non-preferred label (for example, the font color of the first-preferred label is red, and the font color of the non-preferred label is black), so that the user can display the image to be processed Notice the top-preferred tag quickly in the information tag of.

In a second aspect, an embodiment of the present application provides an image information input device. The device includes: an acquisition unit for acquiring an image to be processed; a first classification unit for classifying the image to be processed to obtain a first A classification result; a second classification unit, configured to select a corresponding classification model according to the first classification result, input the to-be-processed image into the classification model, and obtain a second classification result output by the classification model; information input The unit is used to input the second classification result as the information label of the image to be processed.

In a third aspect, an embodiment of the present application provides an electronic device, the electronic device includes a processor, and the processor is configured to run a computer program stored in a memory, so as to implement any of the possible implementation manners provided in the first aspect Methods.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, including computer instructions, which when the computer instructions run on a computer or processor, cause the computer or processor to execute any of the The methods provided by the possible implementations.

In the fifth aspect, the embodiments of the present application provide a computer program product. When the computer program product runs on a computer or a processor, the computer or the processor executes the method provided in any one of the possible implementation manners of the first aspect.

It is understandable that the electronic device described in the third aspect, the computer storage medium described in the fourth aspect, or the computer program product described in the fifth aspect provided above are all used to execute the method provided in the first aspect. Therefore, the beneficial effects that can be achieved can refer to the beneficial effects in the corresponding method, which will not be repeated here.

Description of the drawings

FIG. 1 is a schematic structural diagram of an electronic device 100 provided by an embodiment of the present application;

2 is a block diagram of the software structure of the electronic device 100 provided by an embodiment of the present application;

Figures 3(a) to 3(f) are schematic diagrams of application interfaces provided by embodiments of the present application;

4 is a schematic flowchart of an image information input method provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of a to-be-processed image provided by an embodiment of the present application;

FIG. 6 is a schematic diagram of segmentation of a to-be-processed image provided by an embodiment of the present application;

Figures 7(a) and 7(b) are schematic diagrams of interaction between a user and an electronic device provided by an embodiment of the present application;

FIG. 8 is a schematic flowchart of an image information input method provided by another embodiment of the present application;

Fig. 9 is a structural block diagram of an image information input device provided by an embodiment of the present application.

detailed description

In the following description, for the purpose of illustration rather than limitation, specific details such as a specific system structure and technology are proposed for a thorough understanding of the embodiments of the present application. However, it should be clear to those skilled in the art that the present application can also be implemented in other embodiments without these specific details. In other cases, detailed descriptions of well-known systems, devices, circuits, and methods are omitted to avoid unnecessary details from obstructing the description of this application.

It should be understood that when used in the specification and appended claims of this application, the term "comprising" indicates the existence of the described features, wholes, steps, operations, elements and/or components, but does not exclude one or more other The existence or addition of features, wholes, steps, operations, elements, components, and/or collections thereof.

It should also be understood that in the embodiments of the present application, "at least one" refers to one or more than one.

It should also be understood that the term "and/or" used in the specification and appended claims of this application refers to any combination of one or more of the items listed in the associated and all possible combinations, and includes these combinations.

As used in the description of this application and the appended claims, the term "if" can be construed as "when" or "once" or "in response to" depending on the context.

In addition, in the description of the specification of this application and the appended claims, the terms "first", "second", etc. are only used to distinguish the description, and cannot be understood as indicating or implying relative importance.

The reference to "one embodiment" or "some embodiments" described in the specification of this application means that one or more embodiments of this application include a specific feature, structure, or characteristic described in combination with the embodiment. Therefore, the sentences "in one embodiment", "in some embodiments", "in some other embodiments", "in some other embodiments", etc. appearing in different places in this specification are not necessarily All refer to the same embodiment, but mean "one or more but not all embodiments", unless otherwise specifically emphasized. The terms "including", "including", "having" and their variations all mean "including but not limited to" unless otherwise specifically emphasized.

The steps involved in the image information input method provided in the embodiments of this application are only examples. Not all steps are mandatory steps, or not all information or content in the message is mandatory. During use, it can be increased or decreased as needed.

The same step or steps or messages with the same function in the embodiments of the present application may refer to each other among different embodiments.

The business scenarios described in the embodiments of this application are intended to more clearly illustrate the technical solutions of the embodiments of this application, and do not constitute a limitation on the technical solutions provided in the embodiments of this application. Those of ordinary skill in the art will know that as the network architecture evolves As with the emergence of new business scenarios, the technical solutions provided in the embodiments of this application are equally applicable to similar technical problems.

In order to illustrate the technical solution described in the present application, specific embodiments are used for description below.

First, the electronic equipment involved in the embodiments of this application is introduced. The electronic equipment may be a mobile phone, a tablet computer, a wearable device, a vehicle-mounted device, an augmented reality (AR)/virtual reality (VR) device, a notebook computer, For terminal devices such as ultra-mobile personal computers (UMPC), netbooks, and personal digital assistants (personal digital assistants, PDAs), the embodiments of this application do not impose any restrictions on the specific types of terminal devices.

Please refer to FIG. 1. FIG. 1 is a schematic structural diagram of an electronic device 100 according to an embodiment of the present application.

The electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, and an antenna 2. , Mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, earphone jack 170D, sensor module 180, buttons 190, motor 191, indicator 192, camera 193, display screen 194, and Subscriber identification module (subscriber identification module, SIM) card interface 195, etc. The sensor module 180 can include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, and ambient light Sensor 180L, bone conduction sensor 180M, etc.

It can be understood that the structure illustrated in the embodiment of the present application does not constitute a specific limitation on the electronic device 100. In other embodiments of the present application, the electronic device 100 may include more or fewer components than shown, or combine certain components, or split certain components, or arrange different components. The illustrated components can be implemented in hardware, software, or a combination of software and hardware.

The processor 110 may include one or more processing units. For example, the processor 110 may include an application processor (AP), a modem processor, a graphics processing unit (GPU), and an image signal processor. (image signal processor, ISP), controller, memory, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (NPU) Wait. Among them, the different processing units may be independent devices or integrated in one or more processors. Exemplarily, the processor is configured to execute the image information input method provided in the embodiment of the present application. For example, the processor executes the following steps S401-S404 or steps S901-S906.

The controller may be the nerve center and command center of the electronic device 100. The controller can generate operation control signals according to the instruction operation code and timing signals to complete the control of fetching and executing instructions.

A memory may also be provided in the processor 110 to store instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory can store instructions or data that have just been used or recycled by the processor 110. If the processor 110 needs to use the instruction or data again, it can be directly called from the memory. Repeated accesses are avoided, the waiting time of the processor 110 is reduced, and the efficiency of the system is improved.

In some embodiments, the processor 110 may include one or more interfaces. The interface can include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, and a universal asynchronous transmitter (universal asynchronous transmitter) interface. receiver/transmitter, UART) interface, mobile industry processor interface (MIPI), general-purpose input/output (GPIO) interface, subscriber identity module (SIM) interface, and / Or Universal Serial Bus (USB) interface, etc.

The I2C interface is a bidirectional synchronous serial bus, which includes a serial data line (SDA) and a serial clock line (SCL). In some embodiments, the processor 110 may include multiple sets of I2C buses. The processor 110 may couple the touch sensor 180K, charger, flash, camera 193, etc., respectively through different I2C bus interfaces. For example, the processor 110 may couple the touch sensor 180K through an I2C interface, so that the processor 110 and the touch sensor 180K communicate through the I2C bus interface to implement the touch function of the electronic device 100.

The I2S interface can be used for audio communication. In some embodiments, the processor 110 may include multiple sets of I2S buses. The processor 110 may be coupled with the audio module 170 through an I2S bus to implement communication between the processor 110 and the audio module 170. In some embodiments, the audio module 170 may transmit audio signals to the wireless communication module 160 through an I2S interface, so as to realize the function of answering calls through a Bluetooth headset.

The PCM interface can also be used for audio communication to sample, quantize and encode analog signals. In some embodiments, the audio module 170 and the wireless communication module 160 may be coupled through a PCM bus interface. In some embodiments, the audio module 170 may also transmit audio signals to the wireless communication module 160 through the PCM interface, so as to realize the function of answering calls through the Bluetooth headset. Both the I2S interface and the PCM interface can be used for audio communication.

The UART interface is a universal serial data bus used for asynchronous communication. The bus can be a two-way communication bus. It converts the data to be transmitted between serial communication and parallel communication. In some embodiments, the UART interface is generally used to connect the processor 110 and the wireless communication module 160. For example, the processor 110 communicates with the Bluetooth module in the wireless communication module 160 through the UART interface to realize the Bluetooth function. In some embodiments, the audio module 170 may transmit audio signals to the wireless communication module 160 through a UART interface, so as to realize the function of playing music through a Bluetooth headset.

The MIPI interface can be used to connect the processor 110 with the display screen 194, the camera 193 and other peripheral devices. The MIPI interface includes camera serial interface (camera serial interface, CSI), display serial interface (display serial interface, DSI), etc. In some embodiments, the processor 110 and the camera 193 communicate through a CSI interface to implement the shooting function of the electronic device 100. The processor 110 and the display screen 194 communicate through a DSI interface to realize the display function of the electronic device 100.

The GPIO interface can be configured through software. The GPIO interface can be configured as a control signal or as a data signal. In some embodiments, the GPIO interface can be used to connect the processor 110 with the camera 193, the display screen 194, the wireless communication module 160, the audio module 170, the sensor module 180, and so on. The GPIO interface can also be configured as an I2C interface, I2S interface, UART interface, MIPI interface, etc.

The USB interface 130 is an interface that complies with the USB standard specification, and specifically may be a Mini USB interface, a Micro USB interface, a USB Type C interface, and so on. The USB interface 130 can be used to connect a charger to charge the electronic device 100, and can also be used to transfer data between the electronic device 100 and peripheral devices. It can also be used to connect earphones and play audio through earphones. The interface can also be used to connect other electronic devices, such as AR devices.

It can be understood that the interface connection relationship between the modules illustrated in the embodiment of the present application is merely a schematic description, and does not constitute a structural limitation of the electronic device 100. In other embodiments of the present application, the electronic device 100 may also adopt different interface connection modes in the foregoing embodiments, or a combination of multiple interface connection modes.

The charging management module 140 is used to receive charging input from the charger. Among them, the charger can be a wireless charger or a wired charger. In some wired charging embodiments, the charging management module 140 may receive the charging input of the wired charger through the USB interface 130. In some embodiments of wireless charging, the charging management module 140 may receive the wireless charging input through the wireless charging coil of the electronic device 100. While the charging management module 140 charges the battery 142, it can also supply power to the electronic device through the power management module 141.

The power management module 141 is used to connect the battery 142, the charging management module 140 and the processor 110. The power management module 141 receives input from the battery 142 and/or the charging management module 140, and supplies power to the processor 110, the internal memory 121, the external memory, the display screen 194, the camera 193, and the wireless communication module 160. The power management module 141 can also be used to monitor parameters such as battery capacity, battery cycle times, and battery health status (leakage, impedance). In some other embodiments, the power management module 141 may also be provided in the processor 110. In other embodiments, the power management module 141 and the charging management module 140 may also be provided in the same device.

The wireless communication function of the electronic device 100 can be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modem processor, and the baseband processor.

The antenna 1 and the antenna 2 are used to transmit and receive electromagnetic wave signals. Each antenna in the electronic device 100 can be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve antenna utilization. For example, antenna 1 can be multiplexed as a diversity antenna of a wireless local area network. In other embodiments, the antenna can be used in combination with a tuning switch.

The mobile communication module 150 may provide a wireless communication solution including 2G/3G/4G/5G and the like applied to the electronic device 100. The mobile communication module 150 may include at least one filter, a switch, a power amplifier, a low noise amplifier (LNA), and the like. The mobile communication module 150 can receive electromagnetic waves by the antenna 1, and perform processing such as filtering, amplifying and transmitting the received electromagnetic waves to the modem processor for demodulation. The mobile communication module 150 can also amplify the signal modulated by the modem processor, and convert it into electromagnetic wave radiation via the antenna 1. In some embodiments, at least part of the functional modules of the mobile communication module 150 may be provided in the processor 110. In some embodiments, at least part of the functional modules of the mobile communication module 150 and at least part of the modules of the processor 110 may be provided in the same device.

The modem processor may include a modulator and a demodulator. Among them, the modulator is used to modulate the low frequency baseband signal to be sent into a medium and high frequency signal. The demodulator is used to demodulate the received electromagnetic wave signal into a low-frequency baseband signal. The demodulator then transmits the demodulated low-frequency baseband signal to the baseband processor for processing. The low-frequency baseband signal is processed by the baseband processor and then passed to the application processor. The application processor outputs a sound signal through an audio device (not limited to the speaker 170A, the receiver 170B, etc.), or displays an image or video through the display screen 194. In some embodiments, the modem processor may be an independent device. In other embodiments, the modem processor may be independent of the processor 110 and be provided in the same device as the mobile communication module 150 or other functional modules.

The wireless communication module 160 can provide applications on the electronic device 100 including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) networks), bluetooth (BT), and global navigation satellites. System (global navigation satellite system, GNSS), frequency modulation (frequency modulation, FM), near field communication technology (near field communication, NFC), infrared technology (infrared, IR) and other wireless communication solutions. The wireless communication module 160 may be one or more devices integrating at least one communication processing module. The wireless communication module 160 receives electromagnetic waves via the antenna 2, frequency modulates and filters the electromagnetic wave signals, and sends the processed signals to the processor 110. The wireless communication module 160 may also receive a signal to be sent from the processor 110, perform frequency modulation, amplify it, and convert it into electromagnetic waves to radiate through the antenna 2.

In some embodiments, the antenna 1 of the electronic device 100 is coupled with the mobile communication module 150, and the antenna 2 is coupled with the wireless communication module 160, so that the electronic device 100 can communicate with the network and other devices through wireless communication technology. The wireless communication technology may include global system for mobile communications (GSM), general packet radio service (GPRS), code division multiple access (CDMA), broadband Code division multiple access (wideband code division multiple access, WCDMA), time-division code division multiple access (time-division code division multiple access, TD-SCDMA), long term evolution (LTE), BT, GNSS, WLAN, NFC , FM, and/or IR technology, etc. The GNSS may include the global positioning system (GPS), the global navigation satellite system (GLONASS), the Beidou navigation satellite system (BDS), and the quasi-zenith satellite system (quasi). -zenith satellite system, QZSS) and/or satellite-based augmentation systems (SBAS).

The electronic device 100 implements a display function through a GPU, a display screen 194, an application processor, and the like. The GPU is a microprocessor for image processing, connected to the display 194 and the application processor. The GPU is used to perform mathematical and geometric calculations and is used for graphics rendering. The processor 110 may include one or more GPUs that execute program instructions to generate or change display information.

The display screen 194 is used to display images, videos, and the like. The display screen 194 includes a display panel. The display panel can adopt liquid crystal display (LCD), organic light-emitting diode (OLED), active-matrix organic light-emitting diode or active-matrix organic light-emitting diode (active-matrix organic light-emitting diode). AMOLED, flexible light-emitting diode (FLED), Miniled, MicroLed, Micro-oLed, quantum dot light-emitting diode (QLED), etc. In some embodiments, the electronic device 100 may include one or N display screens 194, and N is a positive integer greater than one.

The electronic device 100 can realize a shooting function through an ISP, a camera 193, a video codec, a GPU, a display screen 194, and an application processor.

The ISP is used to process the data fed back from the camera 193. For example, when taking a picture, the shutter is opened, the light is transmitted to the photosensitive element of the camera through the lens, the light signal is converted into an electrical signal, and the photosensitive element of the camera transfers the electrical signal to the ISP for processing and is converted into an image visible to the naked eye. ISP can also optimize the image noise, brightness, and skin color. ISP can also optimize the exposure, color temperature and other parameters of the shooting scene. In some embodiments, the ISP may be provided in the camera 193.

The camera 193 is used to capture still images or videos. The object generates an optical image through the lens and is projected to the photosensitive element. The photosensitive element may be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The photosensitive element converts the optical signal into an electrical signal, and then transfers the electrical signal to the ISP to convert it into a digital image signal. ISP outputs digital image signals to DSP for processing. DSP converts digital image signals into standard RGB, YUV and other formats of image signals. In some embodiments, the electronic device 100 may include one or N cameras 193, and N is a positive integer greater than one. Exemplarily, the camera is used to obtain the image to be processed in the image information input method provided in the embodiment of the present application, or the image in the video to be processed.

Digital signal processors are used to process digital signals. In addition to digital image signals, they can also process other digital signals. For example, when the electronic device 100 selects a frequency point, the digital signal processor is used to perform Fourier transform on the energy of the frequency point.

Video codecs are used to compress or decompress digital video. The electronic device 100 may support one or more video codecs. In this way, the electronic device 100 can play or record videos in multiple encoding formats, such as: moving picture experts group (MPEG) 1, MPEG2, MPEG3, MPEG4, and so on.

NPU is a neural-network (NN) computing processor. By drawing on the structure of biological neural networks, for example, the transfer mode between human brain neurons, it can quickly process input information, and it can also continuously self-learn. The NPU can realize applications such as intelligent cognition of the electronic device 100, such as image recognition, face recognition, voice recognition, text understanding, and so on.

The external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the electronic device 100. The external memory card communicates with the processor 110 through the external memory interface 120 to realize the data storage function. For example, save music, video and other files in an external memory card.

The internal memory 121 may be used to store computer executable program code, where the executable program code includes instructions. The processor 110 executes various functional applications and data processing of the electronic device 100 by running instructions stored in the internal memory 121. The internal memory 121 may include a storage program area and a storage data area. Among them, the storage program area can store an operating system, at least one application program (such as a sound playback function, an image playback function, etc.) required by at least one function. The data storage area can store data (such as audio data, phone book, etc.) created during the use of the electronic device 100. In addition, the internal memory 121 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash storage (UFS), and the like.

The electronic device 100 can implement audio functions through the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the earphone interface 170D, and the application processor. For example, music playback, recording, etc.

The audio module 170 is used to convert digital audio information into an analog audio signal for output, and is also used to convert an analog audio input into a digital audio signal. The audio module 170 can also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be provided in the processor 110, or part of the functional modules of the audio module 170 may be provided in the processor 110.

The speaker 170A, also called "speaker", is used to convert audio electrical signals into sound signals. The electronic device 100 can listen to music through the speaker 170A, or listen to a hands-free call.

The receiver 170B, also called a "handset", is used to convert audio electrical signals into sound signals. When the electronic device 100 answers a call or voice message, it can receive the voice by bringing the receiver 170B close to the human ear.

The microphone 170C, also called "microphone", "microphone", is used to convert sound signals into electrical signals. When making a call or sending a voice message, the user can make a sound by approaching the microphone 170C through the human mouth, and input the sound signal into the microphone 170C. The electronic device 100 may be provided with at least one microphone 170C. In other embodiments, the electronic device 100 may be provided with two microphones 170C, which can implement noise reduction functions in addition to collecting sound signals. In other embodiments, the electronic device 100 may also be provided with three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, and realize directional recording functions. Exemplarily, the microphone may be used to collect the audio of the video to be processed in the image information input method provided in the embodiment of the present application.

The earphone interface 170D is used to connect wired earphones. The earphone interface 170D may be a USB interface 130, or a 3.5mm open mobile terminal platform (OMTP) standard interface, or a cellular telecommunications industry association (cellular telecommunications industry association of the USA, CTIA) standard interface.

The pressure sensor 180A is used to sense the pressure signal and can convert the pressure signal into an electrical signal. In some embodiments, the pressure sensor 180A may be provided on the display screen 194. There are many types of pressure sensors 180A, such as resistive pressure sensors, inductive pressure sensors, capacitive pressure sensors and so on. The capacitive pressure sensor may include at least two parallel plates with conductive materials. When a force is applied to the pressure sensor 180A, the capacitance between the electrodes changes. The electronic device 100 determines the intensity of the pressure according to the change in capacitance. When a touch operation acts on the display screen 194, the electronic device 100 detects the intensity of the touch operation according to the pressure sensor 180A. The electronic device 100 may also calculate the touched position according to the detection signal of the pressure sensor 180A. In some embodiments, touch operations that act on the same touch position but have different touch operation intensities can correspond to different operation instructions. For example: when a touch operation whose intensity is less than the first pressure threshold is applied to the short message application icon, an instruction to view the short message is executed. When a touch operation with a touch operation intensity greater than or equal to the first pressure threshold acts on the short message application icon, an instruction to create a new short message is executed.

The gyro sensor 180B may be used to determine the movement posture of the electronic device 100. In some embodiments, the angular velocity of the electronic device 100 around three axes (ie, x, y, and z axes) can be determined by the gyroscope sensor 180B. The gyro sensor 180B can be used for image stabilization. Exemplarily, when the shutter is pressed, the gyro sensor 180B detects the shake angle of the electronic device 100, calculates the distance that the lens module needs to compensate according to the angle, and allows the lens to counteract the shake of the electronic device 100 through reverse movement to achieve anti-shake. The gyro sensor 180B can also be used for navigation and somatosensory game scenes.

The air pressure sensor 180C is used to measure air pressure. In some embodiments, the electronic device 100 calculates the altitude based on the air pressure value measured by the air pressure sensor 180C to assist positioning and navigation.

The magnetic sensor 180D includes a Hall sensor. The electronic device 100 can use the magnetic sensor 180D to detect the opening and closing of the flip holster. In some embodiments, when the electronic device 100 is a flip machine, the electronic device 100 can detect the opening and closing of the flip according to the magnetic sensor 180D. Then, according to the detected opening and closing state of the leather case or the opening and closing state of the flip cover, features such as automatic unlocking of the flip cover are set.

The acceleration sensor 180E can detect the magnitude of the acceleration of the electronic device 100 in various directions (generally three axes). When the electronic device 100 is stationary, the magnitude and direction of gravity can be detected. It can also be used to identify the posture of electronic devices, and be used in applications such as horizontal and vertical screen switching, pedometers and so on.

Distance sensor 180F, used to measure distance. The electronic device 100 can measure the distance by infrared or laser. In some embodiments, when shooting a scene, the electronic device 100 may use the distance sensor 180F to measure the distance to achieve fast focusing.

The proximity light sensor 180G may include, for example, a light emitting diode (LED) and a light detector such as a photodiode. The light emitting diode may be an infrared light emitting diode. The electronic device 100 emits infrared light to the outside through the light emitting diode. The electronic device 100 uses a photodiode to detect infrared reflected light from nearby objects. When sufficient reflected light is detected, it can be determined that there is an object near the electronic device 100. When insufficient reflected light is detected, the electronic device 100 can determine that there is no object near the electronic device 100. The electronic device 100 can use the proximity light sensor 180G to detect that the user holds the electronic device 100 close to the ear to talk, so as to automatically turn off the screen to save power. The proximity light sensor 180G can also be used in leather case mode, and the pocket mode will automatically unlock and lock the screen.

The ambient light sensor 180L is used to sense the brightness of the ambient light. The electronic device 100 can adaptively adjust the brightness of the display screen 194 according to the perceived brightness of the ambient light. The ambient light sensor 180L can also be used to automatically adjust the white balance when taking pictures. The ambient light sensor 180L can also cooperate with the proximity light sensor 180G to detect whether the electronic device 100 is in the pocket to prevent accidental touch.

The fingerprint sensor 180H is used to collect fingerprints. The electronic device 100 can use the collected fingerprint characteristics to implement fingerprint unlocking, access application locks, fingerprint photographs, fingerprint answering calls, and so on.

The temperature sensor 180J is used to detect temperature. In some embodiments, the electronic device 100 uses the temperature detected by the temperature sensor 180J to execute a temperature processing strategy. For example, when the temperature reported by the temperature sensor 180J exceeds a threshold value, the electronic device 100 reduces the performance of the processor located near the temperature sensor 180J, so as to reduce power consumption and implement thermal protection. In other embodiments, when the temperature is lower than another threshold, the electronic device 100 heats the battery 142 to avoid abnormal shutdown of the electronic device 100 due to low temperature. In some other embodiments, when the temperature is lower than another threshold, the electronic device 100 boosts the output voltage of the battery 142 to avoid abnormal shutdown caused by low temperature.

Touch sensor 180K, also called "touch panel". The touch sensor 180K may be provided on the display screen 194, and the touch screen is composed of the touch sensor 180K and the display screen 194, which is also called a “touch screen”. The touch sensor 180K is used to detect touch operations acting on or near it. The touch sensor can pass the detected touch operation to the application processor to determine the type of touch event. The visual output related to the touch operation can be provided through the display screen 194. In other embodiments, the touch sensor 180K may also be disposed on the surface of the electronic device 100, which is different from the position of the display screen 194.

The bone conduction sensor 180M can acquire vibration signals. In some embodiments, the bone conduction sensor 180M can acquire the vibration signal of the vibrating bone mass of the human voice. The bone conduction sensor 180M can also contact the human pulse and receive the blood pressure pulse signal. In some embodiments, the bone conduction sensor 180M may also be provided in the earphone, combined with the bone conduction earphone. The audio module 170 can parse the voice signal based on the vibration signal of the vibrating bone block of the voice obtained by the bone conduction sensor 180M, and realize the voice function. The application processor may analyze the heart rate information based on the blood pressure beating signal obtained by the bone conduction sensor 180M, and realize the heart rate detection function.

The button 190 includes a power-on button, a volume button, and so on. The button 190 may be a mechanical button. It can also be a touch button. The electronic device 100 may receive key input, and generate key signal input related to user settings and function control of the electronic device 100.

The motor 191 can generate vibration prompts. The motor 191 can be used for incoming call vibration notification, and can also be used for touch vibration feedback. For example, touch operations for different applications (such as taking photos, audio playback, etc.) can correspond to different vibration feedback effects. Acting on touch operations in different areas of the display screen 194, the motor 191 can also correspond to different vibration feedback effects. Different application scenarios (for example: time reminding, receiving information, alarm clock, games, etc.) can also correspond to different vibration feedback effects. The touch vibration feedback effect can also support customization.

The indicator 192 may be an indicator light, which may be used to indicate the charging status, power change, or to indicate messages, missed calls, notifications, and so on.

The SIM card interface 195 is used to connect to the SIM card. The SIM card can be inserted into the SIM card interface 195 or pulled out from the SIM card interface 195 to achieve contact and separation with the electronic device 100. The electronic device 100 may support 1 or N SIM card interfaces, and N is a positive integer greater than 1. The SIM card interface 195 can support Nano SIM cards, Micro SIM cards, SIM cards, etc. The same SIM card interface 195 can insert multiple cards at the same time. The types of the multiple cards can be the same or different. The SIM card interface 195 can also be compatible with different types of SIM cards. The SIM card interface 195 can also be compatible with external memory cards. The electronic device 100 interacts with the network through the SIM card to implement functions such as call and data communication. In some embodiments, the electronic device 100 adopts an eSIM, that is, an embedded SIM card. The eSIM card can be embedded in the electronic device 100 and cannot be separated from the electronic device 100.

The software system of the electronic device 100 may adopt a layered architecture, an event-driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture. The embodiment of the present invention takes an Android system with a layered architecture as an example to illustrate the software structure of the electronic device 100 by way of example.

FIG. 2 is a block diagram of the software structure of the electronic device 100 provided by an embodiment of the present application.

The layered architecture divides the software into several layers, and each layer has a clear role and division of labor. Communication between layers through software interface. In some embodiments, the Android system is divided into four layers, from top to bottom, the application layer, the application framework layer, the system library and the Android runtime (Android runtime), and the kernel layer.

The application layer can include a series of application packages.

As shown in Figure 2, the application package can include applications such as camera, gallery, calendar, call, map, navigation, input method, Bluetooth, music, video, short message, etc.

The application framework layer provides an application programming interface (application programming interface, API) and a programming framework for applications in the application layer. The application framework layer includes some predefined functions.

As shown in Figure 2, the application framework layer can include a window manager, a content provider, a view system, a phone manager, a resource manager, and a notification manager.

The window manager is used to manage window programs. The window manager can obtain the size of the display screen, determine whether there is a status bar, lock the screen, take a screenshot, etc.

The content provider is used to store and retrieve data and make these data accessible to applications. The data may include video, image, audio, phone calls made and received, browsing history and bookmarks, phone book, etc.

The view system includes visual controls, such as controls that display text, controls that display pictures, and so on. The view system can be used to build applications. The display interface can be composed of one or more views. For example, a display interface that includes a short message notification icon may include a view that displays text and a view that displays pictures.

The phone manager is used to provide the communication function of the electronic device 100. For example, the management of the call status (including connecting, hanging up, etc.).

The resource manager provides various resources for the application, such as localized strings, icons, pictures, layout files, video files, and so on.

The notification manager enables the application to display notification information in the status bar, which can be used to convey notification-type messages, and it can disappear automatically after a short stay without user interaction. For example, the notification manager is used to notify download completion, message reminders, and so on. The notification manager can also be a notification that appears in the status bar at the top of the system in the form of a chart or a scroll bar text, such as a notification of an application running in the background, or a notification that appears on the screen in the form of a dialog window. For example, prompt text information in the status bar, sound a prompt sound, electronic device vibration, flashing indicator light, etc.

Android Runtime includes core libraries and virtual machines. Android runtime is responsible for the scheduling and management of the Android system.

The core library consists of two parts: one part is the function function that the java language needs to call, and the other part is the core library of Android.

The application layer and the application framework layer run in a virtual machine. The virtual machine executes the java files of the application layer and the application framework layer as binary files. The virtual machine is used to perform functions such as object life cycle management, stack management, thread management, security and exception management, and garbage collection.

The system library can include multiple functional modules. For example: surface manager (surface manager), media library (Media Libraries), three-dimensional graphics processing library (for example: OpenGL ES), 2D graphics engine (for example: SGL), etc.

The surface manager is used to manage the display subsystem and provides a combination of 2D and 3D layers for multiple applications.

The media library supports playback and recording of a variety of commonly used audio and video formats, as well as still image files. The media library can support a variety of audio and video encoding formats, such as: MPEG4, H.264, MP3, AAC, AMR, JPG, PNG, etc.

The 3D graphics processing library is used to realize 3D graphics drawing, image rendering, synthesis, and layer processing.

The 2D graphics engine is a drawing engine for 2D drawing.

The kernel layer is the layer between hardware and software. The kernel layer contains at least display driver, camera driver, audio driver, and sensor driver.

The following embodiment of the image information input method can be implemented on a mobile phone with the above hardware structure/software structure.

The visual input involved in the embodiments of the present application will be introduced below. Visual input can recognize the semantic information contained in the visual information from the visual information input by the user. The identified semantic information can be used for information input. Visual information can include static images (such as pictures) or dynamic images (such as videos, etc.).

The visual input can be a separate application or a function in an application.

When the visual input is a separate application, the visual input can be a system program that comes with the electronic device or a third-party application installed on the electronic device. When the user needs to use the visual input function for information input, the user first needs to select the visual input application as a tool for information input.

When the visual input is a function in an application, the interface of the application may include preset buttons. The preset button is used to activate the visual input function.

The following takes visual input as a function in an application as an example to introduce an application scenario of visual input. In this application scenario, the application is an input method application. Referring to Figures 3(a) to 3(f), Figures 3(a) to 3(f) are schematic diagrams of an application interface provided by an embodiment of the present application. As shown in FIG. 3(a), it is an information input interface 10 of an electronic device 100. The information input interface 10 includes an information input box 101, a virtual keyboard control 102 and an image input control 103. in:

The information input box 101 is used to display input information. The information input box can be a search box, short message sending box, query box and other areas where information needs to be input. In this application scenario, the information input box is the search box as an example.

The virtual keyboard control 102 is used for the user to input information into the information input box.

The image input control 103 is used to activate the visual input function. As shown in Figure 3(a), the image input control can be set on the virtual keyboard control. In another application scenario, the image input control can also be set separately from the virtual keyboard control. For example, the image input control is set on the right, left, or above the virtual keyboard control.

The user can enter the information input interface 10 through user operations. The process of entering the information input interface through user operations will be described below in conjunction with the workflow of the software and hardware of the electronic device 100. Exemplary: After the user clicks any position in the information input box, the touch sensor 180K receives the touch operation (that is, the click operation), and the corresponding hardware interrupt is sent to the kernel layer. The kernel layer processes the touch operation into the original input event (including touch coordinates, time stamp of the touch operation, etc.). The original input events are stored in the kernel layer. The application framework layer obtains the original input event from the kernel layer, recognizes that the application corresponding to the input event is an input method application, and then calls the interface of the application framework layer to start the input method application. Then start the display driver by calling the kernel layer to display the virtual keyboard control and image input control to the user; and start the sensor driver by calling the kernel layer to obtain the information input by the user through the sensor corresponding to the virtual keyboard control and the image input control. User's touch operation. So far, the information input interface 10 is displayed on the electronic device 100.

The user can enter the mode selection interface 20 through user operations on the information input interface 10. Specifically, the user operation may be a touch operation (such as a click operation, etc.) of the image input control 103 detected on the information input interface 10. As shown in FIGS. 3(a) and 3(b), in response to a user's click operation on the image input control 103, the electronic device 100 displays a mode selection interface 20. The mode selection interface 20 may be a virtual keyboard control 102, and the virtual keyboard control 102 includes an album application control 201 and a camera application control 202. in:

The photo album application control 201 is used to start the photo album application. The name of the photo album application control can be "album", "gallery" or "photo", etc. As shown in Figure 3(b), in this application scenario, the name of the photo album application control is "Gallery". When it is detected that the user clicks on the photo album application control 201, the electronic device 100 may display the gallery interface 30.

The camera application control 202 is used to start a camera application. The name of the camera application control can be "camera" or "photograph", etc. As shown in Figure 3(b), in this application scenario, the name of the camera application control is "camera". When detecting that the user clicks on the camera application control 202, the electronic device 100 displays the photographing interface 40.

The gallery interface 30 as shown in FIG. 3(c) may include a return button 301, a confirmation button 302, and multiple pictures 303. The user can select the picture that needs to be input from multiple pictures through user operations. Exemplarily, when the user's clicking operation on the return button 301 on the gallery interface 30 is detected, the electronic device 100 displays the mode selection interface 20. When a user's click operation on one or more pictures on the gallery interface 30 is detected, the one or more pictures corresponding to the click operation are displayed as selected (for example, the color of the selected picture becomes darker, or the Add a mark to the selected picture, etc.). After detecting the user's click operation on the confirmation key 302 on the gallery interface 30, the electronic device 100 can display the image label interface 50.

The shooting interface 40 shown in FIG. 3(d) may include a shooting frame 401, a shooting key 402, and a return key 403. The user can obtain the photographed picture through user operation. Exemplarily, when the user's click operation on the shooting button 402 on the shooting interface 40 is detected, the camera on the electronic device 100 obtains the image contained in the shooting frame 401, and after obtaining the shot picture, the shooting interface 40 may display the confirmation button . When the user's click operation on the confirmation key on the shooting interface 40 is detected, the electronic device 100 may display the image label interface 50. When the user's click operation on the return key 403 on the shooting interface 40 is detected, the electronic device 100 displays the mode selection interface 20.

The information label interface 50 shown in FIG. 3(e) may include an information input box 101 and a virtual keyboard control 102, wherein the virtual keyboard control 102 includes a plurality of information labels 501. Wherein, the information tag is semantic information recognized from a picture selected by the user or a photographed picture taken by the user. The user can select at least one information tag from a plurality of information tags as input information. When the user's click operation on the information label on the information label interface 50 is detected, the electronic device 100 may display the input result interface 60.

The input result interface 60 as shown in FIG. 3(f) may include an information input box 101 and input information 601. Among them, the input information 601 is an information label selected by the user. When the user selects multiple information tags, the selected multiple information tags can be combined into input information according to the user's selection order.

In the above visual input application scenario, when the user selects a picture on the gallery interface shown in Figure 3(c) and clicks the confirm button, or when the user clicks the confirm button on the shooting interface shown in Figure 3(d), The electronic device 100 may use the image information input method provided in the embodiments of the present application to obtain the information label of the image to be processed.

In the application scenario of the above-mentioned visual input, the gallery may include pictures and videos, and the user can select the pictures or videos in the gallery. Users can also use the camera to capture pictures or video. Among them, the pictures and captured pictures in the gallery are static images, and the videos and captured videos in the gallery are dynamic images. The image information input method provided in the embodiments of the present application will be introduced below for static images and dynamic images respectively.

First, taking the visual information as a static image as an example, the image information input method provided in the embodiment of the present application is introduced. Please refer to FIG. 4, which is a schematic flowchart of an image information input method provided by an embodiment of the present application. As shown in FIG. 4, as an example and not a limitation, the image information input method may include the following steps:

S401: Acquire an image to be processed.

In the embodiment of the present application, the electronic device 100 may directly obtain a picture input by the user (ie, a picture selected by the user from a gallery or a photographed picture obtained through a camera application), and record the picture input by the user as an image to be processed. The electronic device 100 may also obtain image data from a picture input by the user, and use the image data as an image to be processed. For example, the electronic device obtains the pixel value information of the picture, converts the pixel value information into a bitmap, and records the bitmap as an image to be processed.

In an application scenario, the image to be processed can also be a picture directly input by the user. For example, after the input method application is started, an information input interface 10 is displayed on the electronic device 100, and the information input interface 10 may include an image input box. Users can copy pictures from web pages/chat messages, and then paste the copied pictures into the image input box. After detecting the input event in the image input box, the electronic device 100 obtains the picture input in the image input box, and records the picture as an image to be processed.

Optionally, after the image to be processed is acquired, preprocessing may be performed on the image to be processed, which specifically includes: clipping the image to be processed. For example, crop the image to be processed into a 200×200 image. After the cropped image is obtained, the cropped image can be recorded as the image to be processed. The trimming process can unify the size of the image to be processed, facilitating subsequent image processing.

S402: Perform classification processing on the image to be processed to obtain a first classification result.

In this embodiment of the present application, the first classification result includes at least one first category label.

The category label can be used to represent the semantic information contained in the image to be processed. For example: refer to FIG. 5, which is a schematic diagram of an image to be processed provided in an embodiment of the present application. The image to be processed as shown in FIG. 5 contains animals and cars. Correspondingly, the first classification result of the image to be processed includes two first category labels, namely "animal" and "car".

Optionally, one way of classifying the image to be processed may be: obtaining a pre-trained classifier; inputting the image to be processed into the classifier for classification, and obtaining at least one first category label output by the classifier.

In the process of training the classifier, a large number of sample images can be obtained, and each sample image can be manually annotated, the semantic information of the sample image (that is, the first category label) can be annotated, and the sample image with the annotation can be input into the classification Training in the device. Then input part of the labeled sample images into the classifier for testing. When the classification accuracy of the classifier reaches a certain preset accuracy, the training is completed.

Among them, the construction method of the classifier can be a statistical method, a machine learning method or a neural network method. Since the neural network has the advantages of fast calculation speed and high accuracy of results, it is preferable that the classifier is a neural network.

Optionally, the first classification result further includes a probability value corresponding to each first category label. The greater the probability value, the greater the probability that the first category label corresponding to this probability value can represent the semantic information contained in the image to be processed. Therefore, the first classification result can be preliminarily screened based on the probability value. Specifically: delete the first category label corresponding to the probability value less than the preset value in the first classification result, and only retain the probability value greater than or equal to the preset value The corresponding first category label. In this way, it is equivalent to excluding some less likely semantic information.

Using the trained classifier to classify the image to be processed can improve the efficiency of the classification process. And since the classification accuracy of the trained classifier is high, the accuracy of the first classification result obtained by using the classifier is also high.

In the embodiment of the present application, the foregoing process of classifying the image to be processed is actually a process of roughly classifying the image to be processed. Rough classification is compared to fine classification. In the field of image processing, the higher the degree of refinement of image classification, the smaller the granularity; on the contrary, the lower the degree of refinement of image classification, the greater the granularity. It can be seen that the granularity of coarse classification is larger than that of fine classification. In other words, the degree of refinement of the classification result obtained by the rough classification is lower than the degree of refinement of the classification result obtained by the fine classification. For example: the image to be processed in Figure 5 is roughly classified as "animal" and "car", and the image is classified as "dog" and "car". The "car" category includes "cars", and the "animal" category includes "dogs".

The first classification result obtained by classifying the image to be processed can reflect a larger range of semantic information contained in the image to be processed. However, often a larger range of semantic information cannot reflect the semantic information that users want to express. In order to make the identified semantic information closer to the content that the user wants to express, the first classification result can be reclassified, that is, the first classification result can be fine-grained classification. Specific steps are as follows.

S403: Select a corresponding classification model according to the first classification result, input the to-be-processed image into the classification model, and obtain a second classification result output by the classification model.

It is equivalent to fine-grained classification of the first classification result. Exemplarily, assuming that the first classification result is a person, the corresponding second classification result may be gender, name, etc. Assuming that the first classification result is a two-dimensional code, the corresponding second classification result may be text information, picture information, or network address information corresponding to the two-dimensional code. Assuming that the first classification result is a plant, the corresponding second classification result may be the name of the plant, the type of the plant, and so on.

One way of selecting the corresponding classification model according to the first classification result is: selecting the classification model corresponding to each first category label in the first classification result. For example: Continuing the example in Fig. 5, the first classification result obtained from the image to be processed in Fig. 5 includes two first category labels, namely "animal" and "car". The person classification model corresponding to the first category label "animal" is acquired, and the vehicle classification model corresponding to the first category label "car" is acquired.

The classification model corresponding to each first category label may be pre-trained. In this way, when recognizing the image to be processed, the recognition time can be saved and the recognition accuracy can be ensured.

In addition, each first category label may correspond to at least one classification model, and different classification models output different classification results. Therefore, the second classification result may include at least one second category label. The category range of the second category label is smaller than the category range of the first category label.

When a first category label corresponds to only one classification model, the classification model may be a multi-label classification model, and its output result may include multiple category labels. Exemplary: The classification model corresponding to the first category label "car" of the image to be processed shown in FIG. Other information. For example, the second classification result obtained by using the vehicle model includes two second category labels, namely "car" and "brand A".

When one first category label corresponds to multiple classification models, each classification model can be a single-label classification model, that is, each classification model outputs only one category label. Exemplary: the first category label "car" of the image to be processed as shown in FIG. 5 can correspond to the brand model (the brand model can identify the brand information of the car) and the vehicle type classification model (vehicle type). The model can identify the type of car information). For example, a second category label obtained by using a brand model is "brand A", and a second category label obtained by using a vehicle type model is "car". Therefore, the second classification result includes two second category labels, namely "car" and "brand A".

The results obtained by the above two methods can be the same, but the number of classification models is different. Correspondingly, when training the classification models, the samples used by each classification model are also different.

Optionally, one way to obtain the second classification result is to directly input the to-be-processed image into the classification model corresponding to each first class label to obtain the second classification result output by the classification model.

Since the classification model corresponding to each first category label is different, each classification model can actually only recognize the image corresponding to the corresponding first category label. For example, in the two first category labels of the image to be processed as shown in FIG. 5, the classification model corresponding to "animal" is an animal model, and the classification model corresponding to "car" is a vehicle model. Among them, the animal model can only identify the part of the image that contains "animals", but cannot identify the part of the image that contains "cars." The vehicle model can only recognize the part of the image that contains the "car", and cannot recognize the part of the image that contains the "animal". Therefore, if the image to be processed is input into the character model or the vehicle model, it is equivalent to inputting part of the invalid information into the classification model, and the invalid information will interfere with the effective information, thereby affecting the classification result of the classification model.

In order to solve the foregoing problem, optionally, another method for obtaining the second classification result is provided in the embodiment of the present application, and only valid information in the image to be processed may be input into the classification model for classification. The specific steps include:

Extract the sub-image corresponding to each first category label in the first classification result from the image to be processed, and obtain the classification model corresponding to each first category label in the first classification result; The sub-image corresponding to the first category label is input into the classification model corresponding to the i-th first category label, and the sub-label of the i-th first category label is obtained, where i is a positive integer less than or equal to N, and N is the first category label. The number of labels of the first category in the classification result; the sub-label of each label of the first category in the first classification result is used as the second classification result.

Exemplarily, the process of extracting sub-images can be seen in FIG. 6, which is a schematic diagram of segmentation of an image to be processed provided in an embodiment of the present application. After performing the classification processing in S402 on the image to be processed shown in (a) in FIG. 6, the first classification result obtained contains two first category labels, which are "two-dimensional code" and "text" respectively. Extract the part corresponding to the "QR code" from the image to be processed (the part enclosed by the dotted line 610 as shown in (a) in Figure 6), and obtain the sub-image corresponding to the first category label "QR code" (As shown in Figure 6(b)), extract the part corresponding to the "text" from the image to be processed (the part enclosed by the dotted line 620 as shown in Figure 6(a)) to obtain the first category The sub-image corresponding to the label "text" (as shown in (c) in Figure 6).

Then, input the sub-image shown in (b) in Figure 6 into the classification model corresponding to "QR code" for classification (assuming that the resulting classification result, that is, the sub-tag of "QR code" is "张三#" 55555555555#"), input the sub-image shown in (c) in Figure 6 into the classification model corresponding to "text" for classification (assuming the resulting classification result, that is, the sub-tag of "text" is "name Zhang San" And "Phone 5555555555"). Finally, take "Zhang San#55555555555#", "Name Zhang San" and "Phone 55555555555" as the second classification result.

In the above process of performing the second classification and obtaining the second classification result, each first category label in the first classification result corresponds to a classification model, and the classification model corresponding to the first category label is used to correspond to the first category label. The classification of the sub-images is to perform a more fine-grained classification of the first classification result, which is equivalent to dividing small classes on the basis of a large class, thereby improving the accuracy of semantic recognition of the image to be processed.

Optionally, in order to further narrow the category range of the classification result, after the second classification result is obtained, classification with lower granularity may be continued for multiple times. For example: after obtaining the second classification result, perform the third classification process. Specifically, the corresponding classification model is selected according to the second classification result, the image to be processed is input into the classification model, and the third classification result output by the classification model is obtained. For the process of each classification after the second classification result, reference may be made to the example in S403, which will not be repeated here. The number of classifications can be preset according to actual needs, and there is no specific limitation here. The more times of classification, the lower the granularity of classification, and the smaller the range of classification results obtained.

The classifier used in S402 and the classification model used in S403 may be set separately or integrated. Two interactive application scenarios are introduced below. Referring to Fig. 7(a) and Fig. 7(b), Fig. 7(a) and Fig. 7(b) are schematic diagrams of interaction between a user and an electronic device provided by an embodiment of the present application.

As shown in Figure 7(a), when the classifier used in S402 and the classification model used in S403 are separately set, after the classifier in S402 obtains the first classification result, the electronic device 100 can first classify the first classification result The result is displayed to the user as the information label of the image to be processed, so that the user selects at least one input label from the first classification result (the input label is any one of the first category labels). After the user selects the input tag, the electronic device 100 responds to the detected input tag, selects the corresponding classification model according to the input tag, inputs the image to be processed into the classification model, obtains the second classification result output by the classification model, and classifies the second classification The result is displayed to the user as an information tag of the image to be processed, so that the user selects at least one information tag from the information tags of the image to be processed as input information. When the user chooses to input information, the electronic device 100 displays the input information in the information input box in response to the detected input information.

As shown in Figure 7(b), when the classifier used in S401 and the classification model used in S403 are integrated, after the classifier in S402 obtains the first classification result, the electronic device 100 directly As a result, the corresponding classification model is selected, the image to be processed is input to the classification model, and the second classification result output by the classification model is obtained. The second classification result is displayed to the user as the information label of the image to be processed, so that the user can obtain the information of the image to be processed Select at least one information tag from the tags as input information. When the user chooses to input information, the electronic device 100 displays the input information in the information input box in response to the detected input information.

S404: Input the second classification result as the information label of the image to be processed.

One way of inputting the second classification result as the information label of the image to be processed is: the electronic device 100 can input all the second category labels in the second classification result as the information label of the image to be processed into the information input box. However, in this way, there is more information input into the information input box, and not every second category label can represent the semantic information that the user wants to express.

In order to solve the above problem, optionally, the second classification result also includes the probability value corresponding to each second category label. The greater the probability value, the second category label corresponding to this probability value can represent the image contained in the image to be processed. The greater the possibility of semantic information. Therefore, another way to input the second classification result as the information label of the image to be processed is: the electronic device 100 may select the second classification label with the largest probability value in the second classification result as the information label of the image to be processed, and The obtained information label is input into the information input box.

The above method of determining the information label of the image to be processed is equivalent to replacing the user with information screening, and the information label obtained by the above method is often not the semantic information that the user wants to express. Therefore, preferably, another way of inputting the second classification result as the information label of the image to be processed is as follows: As described in the application scenarios of the embodiments in FIG. 3(a) to FIG. 3(f), the electronic device 100 may Display each second category label in the second classification result to the user, and the user selects at least one of the second category labels as the information label of the image to be processed; then the electronic device 100 responds to the detected information label selected by the user, and Enter the detected information label into the information input box.

Optionally, both the first classification result and the second classification result may be input as the information label of the image to be processed. For a specific method, please refer to the above-mentioned method of inputting the second classification result as the information label of the image to be processed, which will not be repeated here.

In one embodiment, in order to "infer" more semantic information from the image to be processed, optionally, the first classification result and/or the second classification result can be obtained according to the first classification result and/or the second classification result The extended information corresponding to the result, and the second classification result and/or extended information are input as the information label of the image to be processed.

Wherein, the extended information is information related to the first classification result and/or the second classification result. The “relevant” here may refer to information related to all category labels in the first classification result and/or the second classification result. For example, assuming that the second category labels in the second classification result are "Brand A" and "Car" respectively, the obtained extended information may be the introduction information of a brand A car. The “relevant” here may also refer to information related to any category label in the first classification result and/or the second classification result. For example, assuming that the second category labels in the second classification result are "Brand A" and "Car" respectively, the obtained extended information may be brand information of brand A and introduction information of cars.

Optionally, one way of obtaining the extended information corresponding to the first classification result and/or the second classification result is: using a search engine to search the Internet for the extended information corresponding to the first classification result and/or the second classification result. Among them, search engine refers to a retrieval technology that uses specific strategies to retrieve information from the Internet according to user needs and feeds the information back to users. Search engines rely on a variety of technologies, such as web crawler technology, search ranking technology, web page processing technology, big data processing technology, natural language processing technology, etc. In the embodiments of the present application, any existing search engine can be used to search for information, and there is no specific limitation.

The extended information searched by the above method is usually too complicated, the content is large, and the correlation between the information is poor. In order to obtain the extended information that is strongly related to the semantic information represented by the category label, optionally, another method of obtaining the extended information corresponding to the first classification result and/or the second classification result is provided in the embodiment of this application. The method specifically includes: inputting the first classification result and/or the second classification result into a preset instruction detection model to obtain an information query instruction output by the instruction detection model; query according to the information query instruction The extended information corresponding to the first classification result and/or the second classification result.

Among them, the instruction detection model can be pre-trained. The instruction detection model is equivalent to the correspondence between the category label in the first classification result and/or the second classification result and the information query instruction. Each category label can correspond to one or more information query instructions. Information query instructions may include: query parameters, type matching, keyword query, translation, and so on.

Exemplarily, suppose that the category label "A brand car" is input into the instruction detection model, and the output information query instruction is the query parameter, and then the electronic device can use the search engine to query the parameter information of the A brand car from the Internet, and The inquired parameter information is used as the extended information of the category label "A brand car".

Suppose that the category label "QR code" is input into the instruction detection model, and the output information query instruction is type matching, and then the electronic device can use the existing matching rules to obtain the type information of the QR code (such as business cards, official accounts, Web page link, etc.), the type information of the QR code can be further analyzed to obtain the analytical information (such as analyzing the information in the business card), and the type information and/analytic information of the QR code can be used as the category label "QR code" Extended information.

Suppose that the category label "rose" is input into the instruction detection model, and the output information query instruction is a keyword query, and then the electronic device can use the search engine to query the encyclopedia information of the rose from the Internet (such as a simple description of the rose) , Or a link to a web page that introduces roses, etc.), and use encyclopedia information as an extension of the category tag "roses".

Suppose that the category label is "beautiful", and that the user's voice on the electronic device is Chinese, and the category label is input into the instruction detection model. The output information query instruction is a translation. The electronic device can use a translation application or on the Internet Query the Chinese definition corresponding to beautiful on the above, and use the Chinese definition as the extended information of the category label "beautiful".

The foregoing are only examples of information query instructions, and are not used to limit the specific content and functions of the information query instructions.

The instruction detection model can be obtained by pre-training according to actual needs. For example: the user's historical search information and historical input information can be collected, and the historical search information and historical input information can be used as training data to train the instruction detection model. In this way, the trained instruction detection model can reflect the user's information query habits, and the trained instruction detection model can "guess" the information query action that the user wants to perform, and then perform the query based on the "guessed" information query action The extended information makes the acquired extended information closer to the semantic information that the user wants to express, thereby increasing the intelligence of the image information input method.

Correspondingly, after acquiring the extended information of the first classification result and/or the second classification result, the second classification result and/or the extended information can be input as the information label of the image to be processed.

In an application scenario, when the user selects a picture on the gallery interface shown in Figure 3(c) and clicks the confirmation button, or when the user clicks the confirmation button on the shooting interface shown in Figure 3(d), the electronic device 100 Use the image information input method provided in the embodiments of the application to obtain the first classification result and the second classification result of the image to be processed, and input the first classification result and/or the second classification result into the preset instruction detection model to obtain After the information query instruction output by the instruction detection model, the instruction control corresponding to the information query instruction is displayed on the extended information query interface 70, so that the user selects the target instruction from the information query instruction and clicks the instruction control corresponding to the target instruction. In response to the detected instruction control clicked by the user, the electronic device 100 queries the first classification result and/or the extended information corresponding to the second classification result according to the information query instruction corresponding to the instruction control clicked by the user, and compares the second classification result with the extension The information is displayed as the information label of the image to be processed in the information label interface 50 as shown in FIG. 3(e), so that the user can select the information to be input from a plurality of information labels.

In the embodiment shown in FIG. 4, taking the visual information as a static image as an example, the image information input method provided by the embodiment of the present application is introduced. The following takes the visual information as a dynamic image as an example to introduce the image information input method provided in the embodiment of the present application. Please refer to Fig. 8, which is a schematic flowchart of an image information input method provided by another embodiment of the present application. As shown in FIG. 8, as an example and not a limitation, the image information input method may include the following steps:

S801: Obtain a video to be processed.

In the embodiment of the present application, the electronic device 100 may directly obtain the video input by the user (that is, the video selected by the user from the gallery or the captured video obtained through the camera application), and record the video input by the user as a video to be processed. The electronic device 100 may also obtain video information from a video input by the user, and use the video information as a video to be processed. For example, the electronic device obtains the pixel value information of each frame of image in the video, converts the pixel value information into a bitmap, and records the bitmap as a video to be processed.

S802: Extract image information from the video to be processed, and use at least one picture in the image information as the image to be processed.

Video is composed of image information and audio information. Wherein, the image information includes at least one frame of picture.

Each frame of picture in the image information can be recorded as a to-be-processed image, and then each to-be-processed image is classified and processed separately. However, usually, the information contained in the adjacent frames of the video is the same. Therefore, in order to reduce the amount of calculation, the image information can also be sampled, that is, a picture is obtained every few frames and recorded as the image to be processed.

S803: Perform classification processing on the image to be processed to obtain a first classification result.

S804: Select a corresponding classification model according to the first classification result, input the to-be-processed image into the classification model, and obtain a second classification result output by the classification model.

Steps S803-S804 are the same as steps S402-S403 in the embodiment of FIG. 4, and for details, please refer to the description of steps S402-S403, which will not be repeated here.

Since video is composed of image information and audio information, both image information and audio information include semantic information. Therefore, it is necessary to obtain not only the semantic information contained in the image information, but also the semantic information contained in the audio information. Therefore, after obtaining the to-be-processed video in S801, it also includes:

S805: Extract audio information from the video to be processed, perform voice recognition processing on the audio information, and obtain an information tag of the audio information.

The existing automatic speech recognition (ASR) technology can be used to recognize audio information, obtain text information contained in the audio information, and use the recognized text information as the information label of the audio information. The recognized complete sentence can be used as the label of the audio information; it is also possible to extract keywords with grammatical meaning from the recognized complete sentence according to the grammatical characteristics, and use the keyword as the label of the audio information. For example: the recognized complete sentence is "A brand car is a popular car at the moment", and the grammatically meaningful keyword extracted from this sentence is "A brand car", but ignores that it has no grammatical meaning Words, such as prepositions, auxiliary words, etc.

The above steps S803-S804 are the process of obtaining the semantic information contained in the image information, and step S905 is the process of obtaining the semantic information contained in the audio information. These two processes can be processed in parallel, or processed one after the other, which is not specifically limited here.

After the second classification result of the image to be processed is obtained, and the information label of the audio information is obtained, the method further includes:

S806: Input the information label of the audio information and the second classification result as the information label of the image to be processed.

Since the second classification result may include multiple second category labels, the same second category labels may exist in these second category labels, that is, duplicate semantic information. In order to avoid repeatedly inputting the same semantic information, the second classification result can be deduplicated first. Specific steps can include:

If the same second category label exists in the second classification result, the second classification result is deduplicated; the information label of the audio information and the second classification result after deduplication are used as the information label of the image to be processed enter.

Wherein, the de-duplication processing means that only any one second category label in the same second category label is retained. Exemplarily, suppose that there are three second category labels in the second classification result, namely "car", "car" and "car". Among them, there are two "cars", and only one of them is retained. The final second classification result after deduplication includes two second category labels, namely "car" and "car".

The de-duplication processing can also mean that only any one of the second-category tags with the same semantics is retained. Exemplarily, continuing the example in Fig. 6, the second classification result recognized for the image to be processed in Fig. 6 includes three second category labels, which are "Zhang San#55555555555#" and "Name Zhang San". And "Phone 55555555555". Among them, "Name Zhang San" and "Zhang San#55555555555#" these two second category tags both contain the semantic information of the name Zhang San, you can keep only one of the two; "Phone 55555555555" and "Zhang Three #55555555555#" These two second category tags both contain the semantic information of the phone number 55555555555, and only one of the two can be kept. Therefore, the second classification result after deduplication processing may only include one second category label "Zhang San#55555555555#", and may also include two second category labels "Name Zhang San" and "Phone 55555555555".

After the deduplication processing, the information label of the audio information and each second category label in the second classification result after the deduplication processing can be used as the information label of the image to be processed. However, the information tags of the image to be processed obtained in this way are complicated, and they may contain multiple information tags that are invalid or that cannot reflect the semantic information that the user wants to express.

In order to solve the foregoing problem, optionally, the part that can express the same semantic information among the information label of the audio information and the second classification result after deduplication can be used as the information label of the image to be processed. The specific methods are introduced in the following two situations.

Case 1: If there is a first target label in the second classification result after de-duplication processing.

Wherein, the first target label is a second category label that matches the information label of the audio information in the second classification result after deduplication processing.

In this case, it is explained that the information label of the audio information and the information label of the image information (that is, the second classification result after deduplication), there is a part (that is, the first target label) that can express the same semantic information. Therefore, the first target tag is input as the information tag of the image to be processed.

Here, "matching" can mean the same, or it can mean the same semantic information can be expressed. For example: the information tag of the audio information includes "brand A car", the second category tag in the second classification result after deduplication processing includes "brand A car", if the two tags are the same, then "brand A car" is The first target label. For another example: the information tag of the audio information includes "rose", and the second category tag in the second classification result after deduplication processing includes "rose". Because rose and rose represent the same semantic information, the two If the tags match, the second category tag "rose" is recorded as the first target tag.

Case 2: If there is no first target label in the second classification result after de-duplication processing.

In this case, the information label of the audio information and the information label of the image information (the second classification result after de-duplication processing), there is no part that can express the same semantic information (that is, the first target label) .

In this case, the extended information of the second category label in the second classification result after deduplication can be searched until the second target label is searched out, and the second target label is input as the information label of the image to be processed.

The second target tag is the extended information of the second category tag that matches the information tag of the audio information.

Exemplarily, assuming that the information tag of the audio information includes "XX Official Account", the second category tag of the second classification result after deduplication processing includes "QR code", and the searched out extension information of "QR code" There are "XX Official Account" and "XX Company", where the "XX Official Account" is the same as the "XX Official Account" in the information tag of the audio information, and the "XX Official Account" is recorded as the second target tag.

By searching for the extended information of the second category label in the second classification result, it is possible to find the part that can express the same semantic information in both the information label of the audio information and the information label of the image information. For the method of searching for the extended information of the second category label in the second classification result, refer to the description of “obtain the first classification result and/or the extended information corresponding to the second classification result” in an embodiment of step S404. This will not be repeated here.

Through the above method, the part of the second classification result that matches the information label of the audio information is input as the information label of the image to be processed, that is, the same or similar semantic information contained in the image and audio is extracted, which is equivalent to the identification of the Semantic information was screened once. As a result, the information label of the image to be processed is closer to the semantic information that the user wants to express, and the accuracy of the visual input information can be improved.

After obtaining the information label of the image to be processed according to the above method, that is, after obtaining the first target label or the second target label, the first target label or the second target label can be directly input into the information input box as the information label of the image to be processed .

Optionally, the first target tag or the second target tag may also be displayed to the user as the first tag in the information tags of the image to be processed.

In other words, the information label of the image to be processed may include the second classification result, the information label of the audio information, and the first target label/the second target label. However, the first target label/the second target label among them are regarded as the first-preferred label.

Among them, the first push label refers to a label that is obviously different from other information labels among the information labels of the image to be processed. For example: the information tags of the image to be processed are displayed to the user in the form of a sequence, with the first push tag at the beginning of the sequence. Another example: distinguish the font color of the first-preferred label from the font color of the non-preferred label (for example, the font color of the first-preferred label is red, and the font color of the non-preferred label is black), so that the user can display the image to be processed Notice the top-preferred tag quickly in the information tag of.

Corresponding to the image information input method described in the above embodiment, FIG. 9 is a structural block diagram of an image information input device provided in an embodiment of the present application. For ease of description, only parts related to the embodiment of the present application are shown.

Referring to Figure 9, the device includes:

The image acquisition unit 91 is used to acquire an image to be processed.

The first classification unit 92 is configured to perform classification processing on the to-be-processed image to obtain a first classification result.

The second classification unit 93 is configured to select a corresponding classification model according to the first classification result, input the to-be-processed image into the classification model, and obtain a second classification result output by the classification model.

The information input unit 94 is configured to input the second classification result as the information label of the image to be processed.

Exemplarily, taking the Android platform as an example, the working process of the image information input device is introduced.

When acquiring a captured picture through the camera application, the image acquisition unit 91 first starts the camera application, then registers the camera button and/or focus callback function to acquire the image data captured by the camera application, and then converts the image data into a bitmap. And record the bitmap as the image to be processed.

When acquiring a picture from the gallery, the image acquisition unit 91 first starts the photo album application, and then registers the callback function of the selected picture to acquire the data of the selected picture, and then converts the data of the selected picture into a bitmap, and then transfers the bitmap to the selected picture. The image is marked as the image to be processed.

After the image acquiring unit 91 acquires the image to be processed, the image acquiring unit 91 passes the image to be processed (ie, bitmap) to the first classification unit 92 through a focus callback function in a parameter manner. The first classification unit 92 registers the first classification result callback function.

The first classification unit 92 transmits the first classification result to the second classification unit 103 through the first classification result callback function. The second classification unit 93 registers the second classification result callback function.

The second classification unit 93 transmits the second classification result to the information input unit 104 through the second classification result callback function. The information input unit 94 inserts the information tag of the image to be processed into the cursor on the current interface.

Optionally, the first classification result includes at least one first category label.

Optionally, the second classification unit 93 is also used for:

Extract the sub-image corresponding to each of the first category labels in the first classification result from the to-be-processed image, and obtain the classification model corresponding to each of the first category labels in the first classification result ; Input the sub-image corresponding to the i-th first category label in the first classification result into the classification model corresponding to the i-th first category label to obtain the sub-label of the i-th first category label , Wherein the i is a positive integer less than or equal to N, and N is the number of the first category label in the first classification result; the subtags of each first category label in the first classification result As the second classification result.

Optionally, the device 9 further includes:

The extended information acquiring unit is configured to input the to-be-processed image into the classification model, and after obtaining the second classification result output by the classification model, acquire according to the first classification result and/or the second classification result The extended information corresponding to the first classification result and/or the second classification result, wherein the extended information is information related to the first classification result and/or the second classification result.

Correspondingly, the information input unit 94 is further configured to input the second classification result and/or the extended information as the information label of the image to be processed.

Optionally, the extended information acquiring unit is also used to:

Input the first classification result and/or the second classification result into a preset instruction detection model to obtain an information query instruction output by the instruction detection model; query the first category according to the information query instruction The result and/or the extended information corresponding to the second classification result.

Optionally, the image acquisition unit 91 includes:

The image information acquisition module is used to acquire a video to be processed and extract image information from the video to be processed, wherein the image information includes at least one frame of pictures; and at least one frame of the image in the image information is used as the Describe the image to be processed.

Optionally, the image acquisition unit 91 further includes:

The audio information acquisition module is configured to extract audio information from the to-be-processed video after the acquisition of the to-be-processed video; perform voice recognition processing on the audio information to obtain the information tag of the audio information.

Correspondingly, the information input unit 94 is further configured to input the information label of the audio information and the second classification result as the information label of the image to be processed.

Optionally, the second classification result includes at least one second category label.

Optionally, the information input unit 94 is also used to:

If the same second category label exists in the second classification result, the second classification result is deduplicated; the information label of the audio information and the second classification result after the deduplication are used as The information tag input of the image to be processed.

Optionally, the information input unit 94 is also used to:

If there is a first target tag in the second classification result after deduplication processing, then the first target tag is input as the information tag of the image to be processed, where the first target tag is deduplication processing The second category label that matches the information label of the audio information in the subsequent second classification result;

If the first target tag does not exist in the second classification result after deduplication processing, search for the extended information of the second category tag in the second classification result after deduplication processing until the first target tag is searched out. Two target tags, and the second target tag is input as the information tag of the image to be processed; wherein, the second target tag is an extension of the second category tag that matches the information tag of the audio information information.

Those skilled in the art can clearly understand that for the convenience and conciseness of the description, only the division of the above-mentioned functional units and modules is used as an example for illustration. In practical applications, the above-mentioned functions can be allocated to different functional units and modules as required. Module completion, that is, the internal structure of the device is divided into different functional units or modules to complete all or part of the functions described above. The functional units and modules in the embodiments can be integrated in one processing unit, or each unit can exist alone physically, or two or more units can be integrated in one unit. The above-mentioned integrated units can be hardware-based Formal realization can also be realized in the form of a software functional unit. In addition, the specific names of the functional units and modules are only for the convenience of distinguishing each other, and are not used to limit the protection scope of the present application. For the specific working process of the units and modules in the foregoing system, reference may be made to the corresponding process in the foregoing method embodiment, which will not be repeated here.

The embodiments of the present application also provide a computer-readable storage medium, including computer instructions, which when the computer instructions run on a computer or a processor, cause the computer or the processor to execute each of the above-mentioned image information input method embodiments Steps in.

The embodiments of the present application provide a computer program product. When the computer program product runs on a computer or a processor, the computer or the processor realizes the steps in the foregoing image information input method embodiments when executed.

In the foregoing embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented by software, it can be implemented in the form of a computer program product in whole or in part. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the processes or functions described in the embodiments of the present application are generated in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in a computer-readable storage medium or transmitted through the computer-readable storage medium. The computer instructions can be sent from one website site, computer, server, or data center to another website site, computer, Server or data center for transmission. The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center integrated with one or more available media. The usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium (for example, a solid state disk (SSD)), etc.

An embodiment of the present application further provides a chip system, wherein the chip system includes a processor, the processor is coupled with a memory, and the processor executes a computer program stored in the memory to realize the above-mentioned image information. Enter the steps in the method embodiment. The chip system may be a single chip or a chip module composed of multiple chips.

In the above-mentioned embodiments, the description of each embodiment has its own focus. For parts that are not described in detail or recorded in an embodiment, reference may be made to related descriptions of other embodiments.

A person of ordinary skill in the art may realize that the units and method steps of the examples described in combination with the embodiments disclosed herein can be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of this application.

Finally, it should be noted that the above are only specific implementations of this application, but the scope of protection of this application is not limited to this. Any changes or substitutions within the technical scope disclosed in this application shall be covered by this application. Within the scope of protection applied for. Therefore, the protection scope of this application should be subject to the protection scope of the claims.

Claims

An image information input method, characterized in that it comprises:

Obtain the image to be processed;

Perform classification processing on the to-be-processed image to obtain a first classification result;

Selecting a corresponding classification model according to the first classification result, inputting the to-be-processed image into the classification model, and obtaining a second classification result output by the classification model;

The second classification result is input as the information label of the image to be processed.
The image information input method according to claim 1, wherein the first classification result includes at least one first category label;

The selecting a corresponding classification model according to the first classification result, inputting the to-be-processed image into the classification model, and obtaining the second classification result output by the classification model includes:

Extract the sub-image corresponding to each of the first category labels in the first classification result from the to-be-processed image, and obtain the classification model corresponding to each of the first category labels in the first classification result ；

Input the sub-image corresponding to the i-th first category label in the first classification result into the classification model corresponding to the i-th first category label to obtain the sub-label of the i-th first category label, Wherein, i is a positive integer less than or equal to N, and N is the number of labels of the first category in the first classification result;

Use the sub-label of each of the first category labels in the first classification result as the second classification result.
The image information input method according to any one of claims 1 or 2, characterized in that, after inputting the to-be-processed image into the classification model and obtaining a second classification result output by the classification model, the method further comprises:

According to the first classification result and/or the second classification result, the extended information corresponding to the first classification result and/or the second classification result is acquired, wherein the extended information is the same as that of the first classification result. Classification result and/or information related to the second classification result;

Correspondingly, the inputting the second classification result as the information label of the image to be processed includes:

The second classification result and/or the extended information are input as the information label of the image to be processed.
The image information input method according to claim 3, wherein said obtaining the extended information corresponding to the first classification result and/or the second classification result comprises:

Inputting the first classification result and/or the second classification result into a preset instruction detection model to obtain an information query instruction output by the instruction detection model;

Query the extended information corresponding to the first classification result and/or the second classification result according to the information query instruction.
The image information input method according to claim 1, wherein said acquiring the image to be processed comprises:

Acquiring a video to be processed, and extracting image information from the video to be processed, wherein the image information includes at least one frame of picture;

Use at least one picture in the image information as the image to be processed.
The image information input method according to claim 5, characterized in that, after said obtaining the to-be-processed video, it further comprises:

Extract audio information from the to-be-processed video;

Performing voice recognition processing on the audio information to obtain an information tag of the audio information;

Correspondingly, the inputting the second classification result as the information label of the image to be processed includes:

The information label of the audio information and the second classification result are input as the information label of the image to be processed.
The image information input method according to claim 6, wherein the second classification result includes at least one second category label;

The inputting the information label of the audio information and the second classification result as the information label of the image to be processed includes:

If the same second category label exists in the second classification result, perform deduplication processing on the second classification result;

The information label of the audio information and the second classification result after deduplication are input as the information label of the image to be processed.
The image information input method according to claim 7, wherein the input of the information label of the audio information and the second classification result after deduplication processing as the information label of the image to be processed comprises: :

If there is a first target tag in the second classification result after deduplication processing, then the first target tag is input as the information tag of the image to be processed, where the first target tag is deduplication processing The second category label that matches the information label of the audio information in the subsequent second classification result;

If the first target tag does not exist in the second classification result after deduplication processing, search for the extended information of the second category tag in the second classification result after deduplication processing until the first target tag is searched out. Two target tags, and the second target tag is input as the information tag of the image to be processed; wherein, the second target tag is an extension of the second category tag that matches the information tag of the audio information information.
An electronic device, wherein the electronic device includes a processor, and the processor is configured to run a computer program stored in a memory to implement the method according to any one of claims 1 to 8.
A computer storage medium comprising computer instructions, which when the computer instructions run on a computer or a processor, cause the computer or the processor to execute the method according to any one of claims 1 to 8.