CN109584864B

CN109584864B - Image processing apparatus and method

Info

Publication number: CN109584864B
Application number: CN201710913272.4A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2017-09-29
Filing date: 2017-09-29
Publication date: 2023-11-24
Anticipated expiration: 2037-09-29
Also published as: CN109584864A

Abstract

The invention discloses an image processing device, comprising: the voice collector collects voice signals input by a user; the command converter converts the voice signal into an image processing command and a target area according to a target voice command conversion model, wherein the target area is a processing area of an image to be processed; and the image processor processes the target area according to the image processing instruction and the target image processing model. By adopting the embodiment of the invention, the function of processing the image by inputting voice can be realized, the time for a user to learn the image processing software before processing the image is saved, and the user experience is improved.

Description

Image processing apparatus and method

Technical Field

The present invention relates to the field of image processing, and in particular, to an image processing apparatus and method.

Background

After the user takes the picture, in order to show better image effect, the image is processed by PS software in the computer or picture repairing software in the mobile phone.

However, before the PS software in the computer or the image repair software in the mobile phone is used for processing the image, the user needs to learn to grasp the use method of the software, and after grasping the use method of the software, needs to manually input an instruction to control the computer or the mobile phone to perform the image repair operation. This approach is time consuming for the user and the user experience is poor.

Disclosure of Invention

The embodiment of the invention provides an image processing device and method, which realize the function of processing images by inputting voice, save the time for a user to learn image processing software before image processing and improve the user experience.

In a first aspect, an embodiment of the present invention provides an image processing apparatus, including:

the voice collector is used for collecting voice signals input by a user;

the command converter is used for converting the voice signal into an image processing command and a target area according to a target voice command conversion model, wherein the target area is a processing area of an image to be processed;

and the image processor is used for processing the target area according to the image processing instruction and the target image processing model.

In a possible embodiment, the instruction converter comprises:

the instruction converter includes:

a first speech recognizer for converting speech signals into text information by the speech recognition technique;

a speech-to-text converter for converting text information into the image processing instructions by natural language processing techniques and the target speech instruction conversion model;

the first image identifier is used for dividing the area of the image to be processed according to the granularity of the semantic area in the image processing instruction and the image recognition technology, and acquiring the target area.

In a possible embodiment, the instruction converter comprises:

a second speech recognizer for converting the speech signal into the image processing instruction through the speech recognition technique, semantic understanding technique, and the target speech instruction conversion model;

and the second image identifier is used for carrying out region division on the image to be processed according to the granularity of the semantic region in the image processing instruction and the image recognition technology, and acquiring the target region.

In a possible embodiment, the image processing apparatus further includes:

and the memory is used for storing the text information or the image processing instruction or the target area.

In a possible embodiment, the image processor comprises:

the instruction fetching module is used for obtaining M image processing instructions from the memory in a preset time window, wherein M is an integer greater than 1;

and the processing module is used for processing the target area according to the M image processing instructions and the target image processing model.

In a possible embodiment, the processing module is configured to:

deleting image processing instructions with the same functions from the M image processing instructions to obtain N image processing instructions, wherein N is an integer smaller than M;

And processing the target area according to the N image processing instructions and the target image processing model.

In a possible embodiment, the instruction converter is configured to:

and carrying out self-adaptive training on the voice command conversion model to obtain the target voice command conversion model.

In a possible embodiment, the instruction converter adaptively trains the speech instruction conversion model off-line or on-line.

In a possible embodiment, the instruction converter adaptively trains the speech instruction conversion model either supervised or unsupervised.

In a possible embodiment, the instruction converter is further configured to:

converting the voice signal into a prediction instruction according to the voice instruction conversion model;

determining a correlation coefficient of the predicted instruction and an instruction set corresponding to the predicted instruction;

optimizing the voice command conversion model according to the correlation coefficient of the predicted command and the command set corresponding to the predicted command so as to obtain the target voice command conversion model.

In a possible embodiment, the image processing apparatus further includes:

the trainer is used for converting the voice signal into a prediction instruction according to the voice instruction conversion model; determining a correlation coefficient of the predicted instruction and an instruction set corresponding to the predicted instruction; optimizing the voice command conversion model according to the correlation coefficient of the predicted command and the command set corresponding to the predicted command so as to obtain the target voice command conversion model.

In a possible embodiment, the image processor is configured to:

and carrying out self-adaptive training on the image processing model to obtain the target image processing model.

In a possible embodiment, the image processor adaptively trains the image processing model off-line or on-line.

In a possible embodiment, the image processor adaptively training the image processing model is supervised or unsupervised.

In a possible embodiment, the image processor is further configured to:

processing the image to be processed according to the image processing model to obtain a predicted image;

determining a correlation coefficient of the predicted image and a target image corresponding to the predicted image;

and optimizing the image processing model according to the correlation coefficient of the predicted image and the corresponding target image so as to obtain the target image processing model.

In a possible embodiment, the trainer is further configured to:

In a second aspect, an embodiment of the present invention provides an image processing method, including:

collecting voice signals input by a user;

converting the voice signal into an image processing instruction and a target area according to a target voice instruction conversion model, wherein the target area is a processing area of an image to be processed;

and processing the target area according to the image processing instruction and the target image processing model.

In a possible embodiment, the converting the speech signal into the image processing command and the target area according to the target speech command conversion model includes:

converting the voice signal into text information through a voice recognition technology;

converting the text information into the image processing instruction through a natural language processing technology and the target voice instruction conversion model;

and dividing the region of the image to be processed according to the granularity of the semantic region in the image processing instruction and the image recognition technology, and obtaining the target region.

Converting the voice signal into the image processing instruction through a voice recognition technology, a semantic understanding technology and the voice instruction conversion model;

In a possible embodiment, the method further comprises:

storing the text information or the image processing instruction or the target area.

In a possible embodiment, the processing the target area according to the image processing instruction and the target image processing model includes:

obtaining M image processing instructions from the memory within a preset time window, wherein M is an integer greater than 1;

and processing the target area according to the M image processing instructions and the target image processing model.

In a possible embodiment, the processing the target area according to the M pieces of image processing instructions and the target image processing model includes:

In a possible embodiment, before the receiving the speech signal and the image to be processed, the method further comprises:

In a possible embodiment, the adaptive training of the speech instruction conversion model is performed offline or is performed offline.

In a possible embodiment, the adaptive training of the speech instruction conversion model is supervised or unsupervised.

In a possible embodiment, the adaptively training the voice command conversion model to obtain the target voice command conversion model includes:

In a possible embodiment, the adaptive training of the image processing model is performed offline or is performed offline.

In a possible embodiment, the adaptive training of the image processing model is supervised or unsupervised.

In a possible embodiment, the adaptively training the image processing model to obtain the target image processing model includes:

In a third aspect, an embodiment of the present invention further provides an image processing chip, where the chip includes the image processing apparatus of the first aspect of the embodiment of the present invention.

In one possible embodiment, the above-mentioned chips include a master chip and a cooperation chip;

the above-mentioned cooperation chip includes the device according to the first aspect of the embodiment of the present invention, where the above-mentioned main chip is configured to provide a start signal for the above-mentioned cooperation chip, and control transmission of an image to be processed and an image processing instruction to the above-mentioned cooperation chip.

In a fourth aspect, an embodiment of the present invention provides a chip package structure, where the chip package structure includes the image processing chip according to the third aspect of the embodiment of the present invention.

In a fifth aspect, an embodiment of the present invention provides a board, where the board includes the chip package structure according to the fourth aspect of the embodiment of the present invention.

In a sixth aspect, an embodiment of the present invention provides an electronic device, where the electronic device includes the board card according to the fifth aspect of the embodiment of the present invention.

It can be seen that, in the scheme of the embodiment of the invention, the voice collector collects the voice signal input by the user; the command converter converts the voice signal into an image processing command and a target area according to a target voice command conversion model, wherein the target area is a processing area of an image to be processed; and the image processor processes the target area according to the image processing instruction and the target image processing model. Compared with the existing image processing technology, the invention performs image processing through voice, saves the time for a user to learn the image processing software before performing image processing, and improves the user experience.

These and other aspects of the invention will be more readily apparent from the following description of the embodiments.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present invention;

fig. 2 is a schematic partial structure of another image processing apparatus according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a partial structure of another image processing apparatus according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a partial structure of another image processing apparatus according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a chip according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of another chip according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention;

Fig. 8 is a flowchart of an image processing method according to an embodiment of the present invention.

Detailed Description

The following will describe in detail.

The terms "first," "second," "third," and "fourth" and the like in the description and in the claims and drawings are used for distinguishing between different objects and not necessarily for describing a particular sequential or chronological order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the invention. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

Referring to fig. 1, fig. 1 is a schematic structural diagram of an image processing apparatus according to an embodiment of the invention. As shown in fig. 1, the image processing apparatus 100 includes:

the voice collector 101 is configured to collect a voice signal input by a user.

Optionally, the image processing apparatus 100 further includes a noise filter, and the noise filter performs noise reduction processing on the voice signal after the voice signal is acquired by the voice acquirer 101.

Alternatively, the voice collector may be a voice sensor, a microphone, a pickup to obtain other audio collection devices.

Specifically, the voice collector 101 receives the environmental sound signal when receiving the voice signal. The noise filter performs noise reduction processing on the speech signal based on the environmental sound signal. The ambient sound signal is noise for the speech signal.

Further, the audio collector 101 may include a microphone array, which may be used to collect the voice signal and the ambient sound signal, and implement noise reduction processing.

Optionally, in a possible embodiment, the image processing apparatus further includes a first memory. After the voice collector collects the voice signals, the image processing device stores the voice signals into a first memory.

The command converter 102 is configured to convert the speech signal into an image processing command and a target area according to a target speech command conversion model, where the target area is a processing area of an image to be processed.

Alternatively, the instruction converter 102 may obtain the voice signal from the first memory before converting the voice signal into the image processing instruction and the target area according to the voice recognition technology, the natural language processing technology, and the image recognition technology.

Wherein the instruction converter 102 comprises:

a first speech recognizer 1021 for converting the speech signal into text information by a speech recognition technique;

a speech to text converter 1022 for converting the text information into the image processing instructions by natural language processing techniques and the target speech instruction conversion model;

the first image identifier 1023 is configured to perform region division on the image to be processed according to granularity of a semantic region in the image processing instruction and an image recognition technology, and obtain the target region.

Further, the instruction converter 102 further includes:

an acquiring module 1026, configured to acquire granularity of the semantic region in the image processing instruction.

For example, if the image processing apparatus 100 determines that the target region is a face region according to a speech signal, the semantic region is a face region in the image to be processed, and the image processing apparatus acquires a plurality of face regions in the image to be processed with a face as granularity; when the target area is a background, the image processing device divides the image to be processed into a background area and a non-background area; when the target area is a red color area, the image processing apparatus divides the image to be processed into areas of different colors according to colors.

Specifically, the speech recognition technology used in the present invention includes, but is not limited to, using models such as artificial neural network (Artificial Neural Network, ANN), hidden markov model (Hidden Markov Model, HMM), etc., and the above-mentioned first speech recognition unit may process the above-mentioned speech signal according to the above-mentioned speech recognition technology; the natural language processing technology includes, but is not limited to, statistical machine learning, ANN and other methods, and the semantic understanding unit may extract semantic information according to the natural language processing technology; the image recognition technology includes, but is not limited to, algorithms such as a method based on edge detection, a threshold segmentation method, a region growing and watershed algorithm, a gray integral projection curve analysis, template matching, a deformable template, hough transformation, a Snake operator, an elastic image matching technology based on Gabor wavelet transformation, an active shape model, an active appearance model and the like, and the image recognition unit can segment the image to be processed into different regions according to the image recognition technology.

In one possible embodiment, the first speech recognizer 1021 converts the speech signal into text information by the speech recognition technology, and saves the text information in the first memory. The voice text converter 1022 obtains the text information from the first memory, converts the text information into an image processing instruction through a natural language processing technique and the target voice instruction conversion model, and stores the image processing instruction in the first memory; the first image identifier 1023 performs region division on the image to be processed according to granularity of semantic regions in the image processing instruction and an image recognition technology, acquires the target region, and stores the division result and the target region in the second memory.

In one possible embodiment, the instruction converter 102 may also include:

a second speech recognizer 1025 for directly converting the speech signal into the image processing instruction according to a speech recognition technique, a natural language processing technique, and the target speech instruction conversion model, and storing the image processing instruction in a first memory;

The second image identifier 1026 divides the image to be processed according to the granularity of the semantic region for operating the image to be processed according to the image processing instruction, obtains a target region, where the target region is a region for processing the image to be processed, and stores the divided result and the target region in the second memory.

Optionally, before the voice collector 101 receives the voice signal and the image to be processed, the command converter 102 performs adaptive training on the voice command conversion model to obtain the target voice command conversion model.

Wherein, the self-adaptive training of the voice instruction conversion model is performed offline or online.

Specifically, the adaptive training of the voice command conversion model is performed offline, specifically, the command converter 102 performs adaptive training on the voice command conversion model on the basis of hardware thereof, so as to obtain a target voice command conversion model; the adaptive training of the voice command conversion model is performed online, specifically, a cloud server different from the command converter 102 performs adaptive training on the voice command conversion model to obtain a target voice command conversion model. When the command converter 102 needs to use the target voice command conversion model, the command converter 102 obtains the target voice command conversion model from the cloud server.

Optionally, the adaptive training of the voice instruction conversion model is supervised or supervised.

Specifically, the adaptive training of the voice command conversion model is supervised specifically as follows:

the command converter 102 converts the speech signal into a predicted command according to a speech command conversion model; then determining the correlation coefficient of the predicted instruction and the instruction set corresponding to the predicted instruction, wherein the instruction set is a set of instructions obtained by manpower according to the voice signal; the instruction converter 102 optimizes the voice instruction conversion model according to the correlation coefficient between the predicted instruction and the instruction set corresponding to the predicted instruction to obtain the target voice instruction conversion model.

In one possible embodiment, the image processing apparatus 100 further includes:

a trainer 105 for converting the speech signal into predicted instructions according to the speech instruction conversion model; determining a correlation coefficient of the predicted instruction and an instruction set corresponding to the predicted instruction; optimizing the voice command conversion model according to the correlation coefficient of the predicted command and the command set corresponding to the predicted command so as to obtain the target voice command conversion model.

For example, the foregoing adaptive training of the voice command conversion model is supervised specifically including: the command converter 102 or the trainer 105 receives a speech signal containing a relevant command, such as changing the color of an image, rotating a picture, etc. Each command corresponds to a set of instructions. The corresponding instruction set is known for the input speech signals used for adaptive training, and the instruction converter 102 or the trainer 105 obtains the output predicted instruction by using these speech signals as input data of the speech instruction conversion model. The command converter 102 or the trainer 105 calculates the correlation coefficient between the predicted command and the corresponding command set, and adaptively updates parameters (such as weights, offsets, etc.) in the voice command conversion model according to the correlation coefficient, so as to improve the performance of the voice command conversion model, thereby obtaining the target voice command conversion model.

The image processing apparatus 100 further includes:

and a memory 104 for storing the text information or the image processing instruction or the target area.

In a possible embodiment, the memory 104 may be the same memory module as the first memory module and the second memory module, or may be different memory modules.

And an image processor 103 for processing the image to be processed according to the image processing instruction and the target image processing model.

Wherein the image processor 103 comprises:

the instruction fetching module 1031 is configured to obtain M image processing instructions from the storage module within a preset time window, where M is an integer greater than 1;

and the processing module 1032 is used for processing the target area according to the M image processing instructions and the target image processing model.

Optionally, the processing module 1032 is configured to:

Specifically, the above-described preset time window may be understood as a preset time period. After the acquiring unit 1031 acquires M image processing instructions from the storage module 104 within a preset period, the processing module 1032 performs a pairwise comparison on the M image processing instructions, and deletes the instructions with the same function in the M image processing instructions, so as to obtain N image processing instructions. The processing module 1032 processes the image to be processed according to the N processing instructions and the target image processing model.

For example, the processing module 1032 performs a pairwise comparison of the M image processing instructions. When the image processing instruction a is identical to the image processing instruction B, the processing module 1032 deletes one of the image processing instructions a and B having the largest overhead; when the image processing instruction a and the image processing instruction B are different, the processing module 1032 acquires the similarity coefficient of the image processing instruction a and the image processing instruction B. When the similarity coefficient is greater than a similarity threshold, determining that the image processing instruction a and the image processing instruction B are identical in function, and the processing module 1032 deletes one of the image processing instructions a and B having the largest overhead; when the similarity coefficient is smaller than the similarity threshold, the processing module 1032 determines that the image processing instructions a and B are functionally different. The image processing instructions A and B are any two of the M processing instructions.

Specifically, for the above-described image processor 103, both of its input and output are images. The image processor 103 may perform processing on the image to be processed by methods including, but not limited to, ANN and conventional computer vision methods, including, but not limited to: body beautification (such as leg beautification, breast augmentation), face beautification, object replacement (cat changing dog, zebra changing horse, apple changing orange, etc.), background replacement (the following forest changing field), de-occlusion (such as face covering one eye, re-reconstructing eyes), style conversion (one second changing Sanskyline, face changing side), pose conversion (such as standing sitting, face changing side), non-oil painting changing oil painting, changing the color of the image background, and changing the seasonal background where the object is located in the image.

Optionally, the image processor 103 adaptively trains an image processing model to obtain the target image processing model before the voice collector 101 receives the voice signal.

Wherein the adaptive training of the image processing model is performed offline or online.

Specifically, the adaptive training of the image processing model is performed offline, specifically, the image processor 103 performs adaptive training on the image processing model on the basis of hardware thereof, so as to obtain a target voice instruction conversion model; the adaptive training of the image processing model is performed online, specifically, a cloud server different from the image processor 103 performs adaptive training on the image processing model to obtain a target image processing model. When the image processor 103 needs to use the target image processing model, the image processor 103 obtains the target image processing model from the cloud server.

Optionally, the adaptive training of the image processing model is supervised or supervised.

Specifically, the adaptive training of the image processing model is supervised specifically as follows:

the image processor 103 converts the speech signal into a predicted image according to an image processing model; then determining the correlation coefficient of the predicted image and a target image corresponding to the predicted image, wherein the target is an image obtained by manually processing an image to be processed according to a voice signal; the image processor 103 optimizes the image processing model according to the correlation coefficient between the predicted image and the corresponding target image to obtain the target image processing model.

For example, the above-mentioned adaptive training of the image processing model is supervised specifically including: the image processor 103 or the trainer 105 receives a speech signal containing associated commands, such as changing the color of an image, rotating a picture, etc. Each command corresponds to a target image. The corresponding target image is known for the input speech signal used for the adaptive training, and the image processor 103 or the trainer 105 acquires the output predicted image using the speech signal as input data of the image processing model. The image processor 103 or the trainer 105 calculates the correlation coefficient between the predicted image and the corresponding target image, and adaptively updates parameters (such as weights, offsets, etc.) in the image processing model according to the correlation coefficient, so as to improve the performance of the image processing model, thereby obtaining the target image processing model.

In a possible embodiment, the command converter 102 of the image processing apparatus 100 may be configured to adaptively train the voice command conversion model in the command converter 102 to obtain the target voice command conversion model: the image processor 103 of the image processing apparatus 100 may be configured to adaptively train the image processing model in the image processor 103 to obtain the target image processing model.

In a possible embodiment, the image processing apparatus 100 further includes:

training 105, which is used to adaptively train the voice command conversion model in the command converter 102 and the image processing model in the image processor 103, respectively, so as to obtain a target voice command conversion model and a target image processing model.

The trainer 105 may adjust the structure and parameters in the voice command conversion model or in the image processing model by a supervised method or an unsupervised method to improve the performance of the voice command conversion model or the image processing model, and finally obtain a target voice command conversion model or a target image processing model.

In the present embodiment, the image processing apparatus 100 is presented in the form of a module. "module" herein may refer to an application-specific integrated circuit (ASIC), a processor and memory that execute one or more software or firmware programs, an integrated logic circuit, and/or other devices that can provide the described functionality. In addition, the above voice collector 101, instruction converter 102, image processor 103, storage module 104, and trainer 105 may be implemented by the artificial neural network chip shown in fig. 5, 6, 7, and 8.

Alternatively, the instruction converter 102 of the image processing apparatus 100 or the processing module 1032 of the image processor 103 is an artificial neural network chip, that is, the instruction converter 102 and the processing module 1032 of the image processor 103 are two independent artificial neural network chips, and the structures thereof are shown in fig. 5 and 6, respectively.

In the present apparatus, the instruction converter 102 and the image processor 103 may be executed in series, or may be executed in a soft pipeline manner, that is, when the image processor 103 processes a previous image, the instruction converter 102 may process a next image, so that the throughput rate of hardware may be improved, and the image processing efficiency may be improved.

Referring to fig. 5, fig. 5 is a schematic diagram of a structural framework of an artificial neural network chip. As shown in fig. 5, the chip includes:

a control unit 510, a storage unit 520, and an input/output unit 530.

Wherein the control unit 510 includes:

an instruction cache unit 511 for storing instructions to be executed, including neural network operation instructions and general operation instructions.

In one embodiment, instruction cache unit 511 may be a reorder cache.

The instruction processing module 512 is configured to obtain the neural network operation instruction or the general operation instruction from the instruction cache unit, and process the instruction and provide the instruction to the neural network operation unit 519. Wherein, the instruction processing module 512 includes:

A fetch module 513 for obtaining an instruction from the instruction cache unit;

a decoding module 514, configured to decode the acquired instruction;

an instruction queue module 515 for sequentially storing the decoded instructions.

The scalar register module 516 is configured to store operation codes and operands corresponding to the above-mentioned instructions, including a neural network operation code and operand corresponding to a neural network operation instruction, and a general operation code and operand corresponding to a general operation instruction.

The processing dependency relationship module 517 is configured to determine the instruction and the operation code and operand corresponding to the instruction sent from the instruction processing module 512, determine whether the instruction and the previous instruction access the same data, if yes, store the instruction in the storage queue unit 518, and provide the instruction in the storage queue unit to the neural network operation unit 519 after the previous instruction is executed; otherwise, the instruction is directly supplied to the above-described neural network operation unit 519.

Store queue unit 518 is configured to store two consecutive instructions that access the same memory space when the instructions access the memory unit.

Specifically, in order to ensure the correctness of the execution result of the two consecutive instructions, if the current instruction is detected to have a dependency relationship with the data of the previous instruction, the two consecutive instructions must wait in the store queue unit 518 until the dependency relationship is eliminated, so that the two consecutive instructions may be provided to the neural network operation unit.

The neural network operation unit 519 is configured to process the instruction transmitted from the instruction processing module or the storage queue unit.

The storage unit 520 includes a neuron buffer unit 521 and a weight buffer unit 522, and the neural network data model is stored in the neuron buffer unit 521 and the weight buffer unit 522.

An input-output unit 530 for inputting a voice signal and outputting an image processing instruction.

In one embodiment, the storage unit 520 may be a scratch pad memory and the input-output unit 530 may be an IO direct memory access module.

Specifically, the chip 500, that is, the instruction converter 102, converts a voice signal into an image processing instruction specifically includes:

in step 501, the instruction fetching module 513 fetches an operation instruction for speech recognition from the instruction buffer unit 511, and sends the operation instruction to the decoding module 514.

Step 502, the decode module 514 decodes the operation instruction and sends the decoded instruction to the instruction queue 515.

Step 503, acquiring a neural network operation code and a neural network operation operand corresponding to the instruction from the scalar register module 516.

Step 504, the instruction is sent to the process dependency module 517; the processing dependency relationship module 517 judges the operation code and operand corresponding to the instruction, judges whether the instruction and the instruction which has not been executed have a dependency relationship on data, and if not, directly sends the instruction to the neural network operation unit 519; if so, the instruction needs to wait in the store queue unit 518 until it no longer has a dependency on the data of the instruction that has not been executed before, and then send the instruction to the neural network operation unit 519.

In step 505, the neural network operation unit 519 determines the address and size of the required data according to the operation code and operand corresponding to the instruction, and fetches the required data from the storage unit 520, including voice instruction conversion model data and the like.

Step 506, the neural network operation unit 519 executes the neural network operation corresponding to the instruction, so as to complete the corresponding processing, obtain an image processing instruction, and write the image processing instruction back to the storage unit 520.

It should be noted that the storage unit 520 is an on-chip cache unit of the chip shown in fig. 5.

Referring to fig. 6, fig. 6 is a schematic diagram of a structural framework of another artificial neural network chip. As shown in fig. 6, the chip includes:

a control unit 610, a storage unit 620, and an input/output unit 630.

Wherein the control unit 610 includes:

an instruction cache unit 611 for storing instructions to be executed, the instructions including a neural network operation instruction and a general operation instruction.

In one embodiment, instruction cache unit 611 may be a reorder cache.

The instruction processing module 612 is configured to obtain a neural network operation instruction or a general operation instruction from the instruction cache unit, and process the instruction and provide the instruction to the neural network operation unit 619. The instruction processing module 612 includes:

The instruction fetch module 613 is configured to obtain an instruction from the instruction cache unit;

a decoding module 614, configured to decode the acquired instruction;

an instruction queue module 615 for sequentially storing decoded instructions.

Scalar register module 616 is configured to store operation codes and operands corresponding to the above-mentioned instructions, including neural network operation codes and operands corresponding to the neural network operation instructions, and general operation codes and operands corresponding to the general operation instructions.

The processing dependency relationship module 617 is configured to determine the instruction and the operation code and operand corresponding to the instruction sent from the instruction processing module 612, determine whether the instruction and the previous instruction access the same data, if yes, store the instruction in the storage queue unit 618, and provide the instruction in the storage queue unit to the neural network operation unit 619 after the previous instruction is executed; otherwise, the instruction is directly supplied to the above-described neural network operation unit 619.

Store queue unit 618 is configured to store two consecutive instructions that access the same memory space when the instructions access the memory unit.

Specifically, in order to ensure the correctness of the execution result of the two consecutive instructions, if the current instruction is detected to have a dependency relationship with the data of the previous instruction, the two consecutive instructions must wait in the store queue unit 618 until the dependency relationship is eliminated, so that the two consecutive instructions may be provided to the neural network operation unit.

The neural network operation unit 619 is configured to process the instruction transmitted from the instruction processing module or the storage queue unit.

The storage unit 620 includes a neuron caching unit 621 and a weight caching unit 622, and the neural network data model is stored in the neuron caching unit 621 and the weight caching unit 622 described above.

An input-output unit 630 for inputting an image processing instruction and an image to be processed, and outputting the processed image.

In one embodiment, the storage unit 620 may be a scratch pad memory and the input output unit 630 may be an IO direct memory access module.

The specific steps of the chip, that is, the processing module 1032 of the image processor 103, for performing image processing include:

in step 601, the instruction fetching module 613 fetches an image processing instruction generated by the instruction converter from the instruction buffer 611, and sends the operation instruction to the decoding module 614.

Step 602, the decode module 614 decodes the operation instruction and sends the decoded instruction to the instruction queue 815.

Step 603, acquiring a neural network operation code and a neural network operation operand corresponding to the instruction from the scalar register module 616.

Step 604, the instruction is sent to a process dependency module 617; the processing dependency relationship module 617 judges the operation code and the operand corresponding to the instruction, judges whether the instruction and the instruction which has not been executed before have a dependency relationship in data, and if not, directly sends the instruction to the neural network operation unit 619; if so, the instruction needs to wait in the store queue unit 618 until it no longer has a dependency on the data with the instruction that has not been executed before, and then sends the microinstruction corresponding to the instruction to the neural network arithmetic unit 619.

In step 605, the neural network operation unit 619 determines the address and size of the required data according to the operation code and operand corresponding to the instruction, and fetches the required data from the storage unit 620, including the image to be processed, the image processing model data, and so on.

Step 606, the neural network operation unit 619 performs the neural network operation corresponding to the instruction, so as to complete the corresponding processing, and write the processing result back to the storage unit 620.

It should be noted that the storage unit 620 is an on-chip cache unit of the chip shown in fig. 6.

The processing modules 1032 of the instruction converter 102 and the image processor 103 may be artificial neural network chips or general-purpose processing chips, or one of the processing modules 1032 of the instruction converter 102 and the image processor 103 may be an artificial neural network chip and one may be a general-purpose processing chip.

Alternatively, the image processing device may be a data processing device, a robot, a computer, a tablet computer, an intelligent terminal, a mobile phone, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage device, or a wearable device.

It can be seen that, in the scheme of the embodiment of the invention, the voice collector acquires the voice signal input by the user; the instruction converter converts the voice signal into an image processing instruction and a region to be processed of the image to be processed according to a target voice instruction conversion model; and the image processor processes the processing area of the image to be processed according to the image processing instruction and the target image processing model. Compared with the existing image processing technology, the invention performs image processing through voice, saves the time for a user to learn the image processing software before performing image processing, and improves the user experience.

In a possible embodiment, an image processing chip comprises the image processing device shown in fig. 1 described above.

The chip comprises a main chip and a cooperation chip;

The cooperation chip includes the chips shown in fig. 5 and 6.

Alternatively, the image processing chip may be used in a camera, a mobile phone, a computer, a notebook, a tablet computer or other image processing devices.

In one possible embodiment, the embodiment of the invention provides a chip packaging structure, which comprises the image processing chip.

In one possible embodiment, the embodiment of the invention provides a board card, which comprises the chip packaging structure.

In one possible embodiment, the embodiment of the invention provides an electronic device, which comprises the board card.

In one possible embodiment, the embodiment of the invention provides another electronic device, which comprises the board card, an interactive interface, a control unit and a voice collector.

As shown in fig. 7, the voice collector is used for receiving voice and transmitting the voice and the image to be processed as input data to a chip inside the board card.

Alternatively, the image processing chip may be an artificial neural network processing chip.

Preferably, the voice collector is a microphone or a multi-array microphone.

Wherein the chip inside the card includes the same embodiments as those shown in fig. 5 and 6, and is used to obtain corresponding output data (i.e. processed image) and transmit the output data to the interactive interface.

The interactive interface receives the output data of the chip (which can be regarded as an artificial neural network processor) and converts the output data into feedback information in a proper form to be displayed to a user.

Wherein the control unit receives a user's operation or command and controls the operation of the entire image processing apparatus.

Optionally, the electronic device may be a data processing apparatus, a robot, a computer, a tablet computer, an intelligent terminal, a mobile phone, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage device, or a wearable device.

Referring to fig. 8, fig. 8 is a flowchart of an image processing method according to an embodiment of the present invention. As shown in fig. 8, the method includes:

S801, the image processing device collects voice signals input by a user.

S802, the image processing device converts the voice signal into an image processing instruction and a target area according to a target voice instruction conversion model, wherein the target area is a processing area of an image to be processed.

In a possible embodiment, the method further comprises:

S803, the image processing device processes the target area according to the image processing instruction and a target image processing model.

obtaining M image processing instructions from the storage module in a preset time window, wherein M is an integer greater than 1;

and carrying out self-adaptive training on the voice command conversion model to obtain a target voice command conversion model.

In a possible embodiment, the adaptively training the voice command conversion model to obtain a target voice command conversion model includes:

And carrying out self-adaptive training on the image processing model to obtain a target image processing model.

Note that, the specific implementation of each step of the method shown in fig. 8 may refer to the specific implementation of the image processing apparatus, which is not described herein.

It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present invention is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present invention. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present invention.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, such as the division of the units, merely a logical function division, and there may be additional manners of dividing the actual implementation, such as multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, or may be in electrical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units described above may be implemented in hardware.

The foregoing has outlined rather broadly the more detailed description of embodiments of the invention, wherein the principles and embodiments of the invention are explained in detail using specific examples, the above examples being provided solely to facilitate the understanding of the method and core concepts of the invention; meanwhile, as those skilled in the art will appreciate, modifications will be made in the specific embodiments and application scope in accordance with the idea of the present invention, and the present disclosure should not be construed as limiting the present invention.

Claims

1. An image processing apparatus, comprising:

the voice collector is used for collecting voice signals input by a user;

the image processor is used for processing the target area according to the image processing instruction and a target image processing model;

Wherein the instruction converter comprises a first image identifier or a second image identifier, the first image identifier or the second image identifier being used for:

dividing the image to be processed into areas according to granularity of semantic areas in the image processing instruction and an image recognition technology, and obtaining the target area;

the image processor includes:

the instruction fetching module is used for acquiring M image processing instructions from a memory of the image processing device within a preset time window;

2. The image processing apparatus according to claim 1, wherein the instruction converter includes:

a first speech recognizer for converting the speech signal into text information by a speech recognition technique;

and the voice text converter is used for converting text information into the image processing instruction through a natural language processing technology and the target voice instruction conversion model.

3. The image processing apparatus according to claim 1, wherein the instruction converter includes:

And the second voice recognizer is used for converting the voice signal into the image processing instruction through a voice recognition technology, a semantic understanding technology and the target voice instruction conversion model.

4. An image processing apparatus according to claim 1 or 3, characterized in that the image processing apparatus further comprises:

and the memory is used for storing the image processing instruction or the target area.

5. The image processing apparatus according to claim 2, wherein the image processing apparatus further comprises:

6. The image processing apparatus according to claim 1, wherein the processing module is configured to:

deleting image processing instructions with the same functions in the M image processing instructions to obtain N image processing instructions, wherein M is an integer greater than 1, and N is an integer smaller than M;

7. The image processing apparatus according to claim 1, wherein the instruction converter is configured to:

8. The image processing apparatus of claim 7, wherein the instruction converter adaptively trains the speech instruction conversion model either off-line or on-line.

9. The image processing apparatus according to claim 7 or 8, wherein the instruction converter adaptively trains the speech instruction conversion model either supervised or unsupervised.

10. The image processing apparatus of claim 7, wherein the instruction converter is further configured to:

11. The image processing apparatus according to claim 7, wherein the image processing apparatus further comprises:

12. The image processing apparatus of claim 1, wherein the image processor is configured to:

13. The image processing apparatus of claim 12, wherein the image processor adaptively trains the image processing model offline or online.

14. The image processing apparatus according to claim 12 or 13, wherein the image processor adaptively trains the image processing model is supervised or unsupervised.

15. The image processing apparatus of claim 12, wherein the image processor is further configured to:

16. The image processing apparatus of claim 11, wherein the trainer is further configured to:

Processing the image to be processed according to an image processing model to obtain a predicted image;

17. The image processing apparatus according to claim 1, wherein the image processing apparatus is further configured to, before the capturing of the voice signal input by the user and the acquisition of the image to be processed:

18. The image processing apparatus of claim 17, wherein the adaptive training of the speech instruction conversion model is performed offline or offline.

19. The image processing apparatus according to claim 17 or 18, wherein the adaptive training of the speech instruction conversion model is supervised or unsupervised.

20. The image processing apparatus according to claim 17, wherein the image processing apparatus is specifically configured to:

21. An image processing method, comprising:

collecting voice signals input by a user;

converting the voice signal into an image processing instruction and a target area according to a target voice instruction conversion model, wherein the target area is a processing area of an image to be processed; the image to be processed is subjected to region division according to granularity of semantic regions in the image processing instruction and an image recognition technology, and the target region is obtained;

processing the target area according to the image processing instruction and a target image processing model;

the processing the target area according to the image processing instruction and the target image processing model comprises the following steps:

obtaining M image processing instructions from a memory in a preset time window, wherein M is an integer greater than 1;

22. The method of claim 21, wherein said converting said speech signal into image processing instructions and target areas according to a target speech instruction conversion model comprises:

and converting the text information into the image processing instruction through a natural language processing technology and the target voice instruction conversion model.

23. The method of claim 21, wherein said converting said speech signal into image processing instructions and target areas according to a target speech instruction conversion model comprises:

the voice signal is converted into the image processing instruction through a voice recognition technology, a semantic understanding technology and the voice instruction conversion model.

24. The method according to any one of claims 21 or 23, further comprising:

storing the image processing instruction or the target area.

25. The method of claim 22, wherein the method further comprises:

26. The method of claim 21, wherein said processing said target region according to said M image processing instructions and said target image processing model comprises:

Deleting the image processing instructions with the same functions in the M image processing instructions to obtain N image processing instructions, wherein N is an integer smaller than M;

27. The method of claim 21, wherein prior to the capturing the user-input speech signal and the capturing the image to be processed, the method further comprises:

28. The method of claim 27, wherein the adaptive training of the voice command conversion model is performed offline or offline.

29. The method of claim 27 or 28, wherein the adaptive training of the speech instruction conversion model is supervised or unsupervised.

30. The method of claim 27, wherein adaptively training the speech instruction conversion model to obtain the target speech instruction conversion model comprises:

31. The method of claim 21, wherein prior to the capturing the user-input speech signal and the capturing the image to be processed, the method further comprises:

32. The method of claim 31, wherein the adaptive training of the image processing model is performed offline or offline.

33. The method of claim 31 or 32, wherein the adaptive training of the image processing model is supervised or unsupervised.

34. The method of claim 31, wherein the adaptively training the image processing model to obtain the target image processing model comprises:

35. An image processing chip, characterized in that the image processing chip comprises an image processing apparatus according to any one of claims 1-20.

36. A chip package structure, characterized in that the chip package structure comprises the chip of claim 35.

37. A board comprising the chip package structure of claim 36.

38. An electronic device comprising the board card of claim 37.