CN112309389A

CN112309389A - Information interaction method and device

Info

Publication number: CN112309389A
Application number: CN202010135362.7A
Authority: CN
Inventors: 不公告发明人
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Beijing ByteDance Network Technology Co Ltd
Priority date: 2020-03-02
Filing date: 2020-03-02
Publication date: 2021-02-02

Abstract

The embodiment of the disclosure discloses an information interaction method and device. One embodiment of the method comprises: in response to receiving a reading request initiated by a user, acquiring an image to be recognized and voice information input by the user, wherein the image to be recognized is an image obtained by shooting a reading object pointed by the user, and the reading object comprises a target number of characters; identifying an image to be identified to obtain a target text comprising a target number of characters; recognizing the voice information to obtain intention information for representing the user intention of the user; and acquiring and outputting the audio corresponding to the target text based on the intention information. According to the embodiment, the click-reading can be performed based on the gesture and the voice of the user, the real intention of the user can be recognized during the click-reading, the audio matched with the real intention of the user is output, and the accuracy of information interaction is improved.

Description

Information interaction method and device

Technical Field

The embodiment of the disclosure relates to the technical field of computers, in particular to an information interaction method and device.

Background

The point reading is a technology designed for students and used for assisting the students in learning. Reading devices for student reading, such as a reading machine, a reading pen and the like, already exist in the prior art.

Specifically, when a user encounters an unknown character or word in the learning process, the unknown character or word can be pointed by the gesture, and then the pointing and reading device can recognize the gesture of the user to determine the object pointed by the user, and further acquire and play the pronunciation of the object pointed by the user.

Disclosure of Invention

The embodiment of the disclosure provides an information interaction method and device.

In a first aspect, an embodiment of the present disclosure provides an information interaction method, where the method includes: in response to receiving a reading request initiated by a user, acquiring an image to be recognized and voice information input by the user, wherein the image to be recognized is an image obtained by shooting a reading object pointed by the user, and the reading object comprises a target number of characters; identifying an image to be identified to obtain a target text comprising a target number of characters; recognizing the voice information to obtain intention information for representing the user intention of the user; and acquiring and outputting the audio corresponding to the target text based on the intention information.

In some embodiments, based on the intention information, obtaining and outputting the audio corresponding to the target text includes: and responding to the intention information representing that the user intention of the user is the recognized characters, and acquiring and outputting the audio corresponding to the characters included in the target text.

In some embodiments, based on the intention information, obtaining and outputting the audio corresponding to the target text includes: in response to the intention information characterizing the user's user intention as identifying a vocabulary, determining whether the target text includes a vocabulary; and responding to the fact that the target text comprises the vocabulary, and acquiring and outputting the audio corresponding to the vocabulary comprised by the target text.

In some embodiments, based on the intention information, the obtaining and outputting the audio corresponding to the target text further includes: and responding to the situation that the target text does not comprise the vocabulary, and acquiring and outputting the audio corresponding to the characters contained in the target text.

In some embodiments, after obtaining and outputting the audio corresponding to the target text, the method further includes: acquiring follow-up reading audio input by a user aiming at the acquired audio; matching the follow-up reading audio with the output audio corresponding to the target text to obtain a matching result and outputting the matching result.

In a second aspect, an embodiment of the present disclosure provides an information interaction apparatus, including: the device comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is configured to respond to a click-to-read request initiated by a user and acquire an image to be recognized and voice information input by the user, the image to be recognized is an image obtained by shooting a click-to-read object pointed by the user, and the click-to-read object comprises a target number of characters; the first recognition unit is configured to recognize an image to be recognized and obtain a target text comprising a target number of characters; the second recognition unit is configured to recognize the voice information and obtain intention information used for representing the user intention of the user; and the output unit is configured to acquire and output the audio corresponding to the target text based on the intention information.

In some embodiments, the output unit is further configured to: and responding to the intention information representing that the user intention of the user is the recognized characters, and acquiring and outputting the audio corresponding to the characters included in the target text.

In some embodiments, the output unit includes: a determination module configured to determine whether the target text includes a vocabulary in response to the intent information characterizing the user's intent of the user as identifying the vocabulary; and the first output module is configured to respond to the target text including the vocabulary, and acquire and output the audio corresponding to the vocabulary included in the target text.

In some embodiments, the output unit further comprises: and the second output module is configured to respond to the target text not including the vocabulary, and acquire and output the audio corresponding to the characters included in the target text.

In some embodiments, the apparatus further comprises: a second acquisition unit configured to acquire a read-after audio input by a user for the acquired audio input; and the matching unit is configured to match the follow-up reading audio with the output audio corresponding to the target text, obtain a matching result and output the matching result.

In a third aspect, an embodiment of the present disclosure provides an electronic device, including: one or more processors; the storage device stores one or more programs thereon, and when the one or more programs are executed by the one or more processors, the one or more processors implement the method of any embodiment of the information interaction method.

In a fourth aspect, embodiments of the present disclosure provide a computer-readable medium on which a computer program is stored, which when executed by a processor, implements the method of any of the above-described information interaction methods.

According to the information interaction method and the information interaction device, the image to be recognized and the voice information input by the user are obtained in response to the fact that the click-to-read request initiated by the user is received, wherein the image to be recognized is an image obtained by shooting a click-to-read object pointed by the user, the click-to-read object comprises a target number of characters, the image to be recognized is recognized, a target text comprising the target number of characters is obtained, then the voice information is recognized, intention information used for representing the intention of the user is obtained, finally, the audio corresponding to the target text is obtained and output based on the intention information, therefore, the click-to-read can be conducted based on the gesture and the voice of the user, the real intention of the user can be recognized during the click-to-read, the audio matched with the real intention of the user is output, and the accuracy of information interaction is improved.

Drawings

Other features, objects and advantages of the disclosure will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present disclosure may be applied;

FIG. 2 is a flow diagram of one embodiment of an information interaction method according to the present disclosure;

FIG. 3 is a schematic diagram of an application scenario of an information interaction method according to an embodiment of the present disclosure;

FIG. 4 is a flow diagram of yet another embodiment of an information interaction method according to the present disclosure;

FIG. 5 is a schematic block diagram of one embodiment of an information-interacting device, according to the present disclosure;

FIG. 6 is a schematic block diagram of a computer system suitable for use with an electronic device implementing embodiments of the present disclosure.

Detailed Description

The present disclosure is described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of the information interaction method or information interaction apparatus of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. Various client applications, such as a click-to-read type software, an educational learning type software, a search type application, an instant messaging tool, a mailbox client, a social platform software, etc., may be installed on the

terminal devices

101, 102, 103.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, and 103 are hardware, they may be various electronic devices having a shooting function and a voice input function, including, but not limited to, a smart phone, a tablet computer, an e-book reader, an MP3 player (Moving Picture Experts Group Audio Layer III, motion Picture Experts Group Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion Picture Experts Group Audio Layer 4), a laptop portable computer, a desktop computer, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

The server 105 may be a server providing various services, such as a background server providing support for software installed on the

terminal devices

101, 102, 103. The background server may analyze and perform other processing on the received data such as the click-to-read request, and feed back a processing result (e.g., audio corresponding to the target text) to the terminal device.

It should be noted that the information interaction method provided by the embodiment of the present disclosure may be executed by the

terminal devices

101, 102, and 103, or may be executed by the server 105, and accordingly, the information interaction apparatus may be disposed in the

terminal devices

101, 102, and 103, or may be disposed in the server 105.

The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. In a case where data used in outputting audio corresponding to the target text is not required to be acquired from a remote location, the system architecture may include only a terminal device or a server without a network.

With continued reference to FIG. 2, a flow 200 of one embodiment of an information interaction method according to the present disclosure is shown. The information interaction method comprises the following steps:

step 201, in response to receiving a click-to-read request initiated by a user, acquiring an image to be recognized and voice information input by the user.

In this embodiment, an execution main body of the information interaction method (for example, the terminal device shown in fig. 1) may obtain the image to be recognized and the voice information input by the user in response to receiving a click-to-read request initiated by the user through a wired connection manner or a wireless connection manner. The click-to-read request can be used for requesting to acquire the audio of the click-to-read object pointed by the user. The image to be recognized may be an image obtained by photographing a reading target pointed by a user. The click-to-read object may include a target number of words. The target number may be a preset number, for example 3; alternatively, the target number may be a number determined based on the text pointed to by the user.

As an example, after initiating a click-to-read request, a user may point to a certain text, the execution main body may detect two texts adjacent to the text pointed by the user, and if two texts adjacent to the text pointed by the user are detected, the text pointed by the user and the two texts adjacent to the text pointed by the user may be determined as a click-to-read object (at this time, the click-to-read object includes three (that is, a preset number of) texts); if the execution main body only detects one character adjacent to the character pointed by the user, the character pointed by the user and one character adjacent to the character pointed by the user can be determined as the click-to-read object (in this case, the click-to-read object includes two characters (i.e., the number determined based on the character pointed by the user)).

In practice, the user may point to the text in various ways, for example, the user may point to the text with a finger, with a pen, and so on. The text pointed by the user can be text included in various objects, such as text included in a book, text included in an electronic device, and the like.

Specifically, the user may initiate the click-to-read request in various manners, for example, a preset voice may be sent (e.g., "i click to read"), or a preset button may be clicked. After the user initiates the reading request, the user can point to the reading object and input the voice information. The voice information may be information indicating a target that the user desires to click to read, and for example, the voice information may be "how to read this word", or may also be "how to read this word".

In practice, after receiving a click-to-read request initiated by a user, the execution body can start a shooting function and a voice receiving function. Furthermore, the execution main body can shoot the click-to-read object pointed by the user, obtain an image to be recognized and obtain voice information input by the user.

Step 202, identifying the image to be identified to obtain a target text comprising a target number of characters.

In this embodiment, based on the image to be recognized obtained in step 201, the executing body may recognize the image to be recognized, and obtain a target text including a target number of characters.

Specifically, the executing entity may recognize the image to be recognized by various methods to obtain the target text, for example, the executing entity may recognize the image to be recognized by an OCR (Optical Character Recognition) technique to obtain the target text. OCR can determine a shape of a character by detecting dark and light patterns of the character and then translating the shape into computer text using character recognition methods. Or, the executing body may also recognize the image to be recognized by using a deep learning method to obtain the target text. Specifically, the deep learning method comprises the steps of collecting training samples, training a deep learning model by using the training samples, identifying the image to be identified by using the trained deep learning model and the like. It should be noted that the deep learning method is a widely-known technology, and is not described herein again.

Step 203, identifying the voice information to obtain intention information for representing the user intention of the user.

In this embodiment, based on the voice information obtained in step 201, the executing entity may recognize the voice information to obtain intention information for representing the user intention of the user. Wherein the intention information can be used to characterize whether the target that the user desires to click-read is a word or a word.

Specifically, the executing body may first adopt an existing speech recognition method to convert the speech information into text information, and then adopt an existing semantic analysis method to perform semantic analysis on the text information to obtain intention information for representing the user intention of the user.

As an example, the voice information input by the user is "how to read this word". The execution body may first convert the voice information "how to read this word" into the text information "how to read this word". Then, the execution subject may perform semantic analysis on the text information "how to read this word" to obtain the intention information "word". Wherein, the intention information "word" can be used to characterize that the target of the user's desire to click-to-read is a word.

It should be noted that, this step may be executed before the step 202 is executed, or may be executed in synchronization with the step 202, and the execution order of the step 202 and the step 203 described in this embodiment should not be taken as a limitation to this application.

And step 204, acquiring and outputting the audio corresponding to the target text based on the intention information.

In this embodiment, based on the intention information obtained in step 203, the executing entity may obtain the audio corresponding to the target text and output the audio corresponding to the target text. The audio corresponding to the target text may be the audio corresponding to the words included in the target text, or may be the audio corresponding to the words included in the target text, and may be specifically determined based on the intention information.

In some optional implementation manners of this embodiment, the executing body may, in response to the intention information representing that the user intention of the user is to identify a word, acquire and output an audio corresponding to the word included in the target text.

As an example, the target text is "not logged in", the intention information obtained based on step 203 may be "words", and the executing entity may obtain and output the audio corresponding to "not", "including", and "recording" included in the target text in response to the intention information representing the user's intention as the recognition words.

It should be noted that the intention information for characterizing the user intention as the recognition word may include a word identifier, and further, after obtaining the intention information, the executing body may determine whether the intention information is used for characterizing the user intention as the recognition word. The text identifier may be a "word" or an identifier preset by a technician, such as "W".

In some optional implementations of the embodiment, the executing body may determine whether the target text includes a vocabulary in response to the intention information characterizing the user intention of the user as a recognized vocabulary. And responding to the fact that the target text comprises the vocabulary, and acquiring and outputting the audio corresponding to the vocabulary comprised by the target text.

As an example, the target text is "not logged in", the intention information obtained based on step 203 may be "words", and the execution subject may obtain and output audio corresponding to the word "logged in" included in the target text in response to the intention information representing the user intention of the user as a recognized word.

It should be noted that the intention information for characterizing the user intention as the recognition vocabulary may include vocabulary recognition, and further, the executing body may determine whether the intention information is used for characterizing the user intention of the user as the recognition vocabulary after obtaining the intention information. The word identifier may be a "word" or an identifier preset by a technician, such as "C".

In practice, the word audio library and the word audio library may be set in advance. Furthermore, when the audio corresponding to the target text to be acquired is the audio corresponding to the word included in the target text, the execution main body can search the audio corresponding to the word included in the target text from the word audio library; when the audio corresponding to the target text to be acquired is the audio corresponding to the word included in the target text, the execution main body may search the audio corresponding to the word included in the target text from the word audio library.

In some optional implementation manners of this embodiment, after it is determined that the intention information represents the user intention of the user as a recognized word, the executing main body may further obtain and output an audio corresponding to a word included in the target text in response to that the target text does not include a word.

In practice, there is a case where the user's user intention is to recognize words, but the target text does not include words, which may be caused by a user's misstatement. For example, the voice information input by the user is "how to read this word", but the target text only includes one word "floating". At this time, the execution main body may directly acquire and output the audio corresponding to the text included in the target text.

The realization mode considers the information interaction under the scene of the user misstatement, and is beneficial to improving the diversity and the robustness of the information interaction; moreover, the real intention of the user can be automatically identified under the condition that the user is mishaped, and the user experience is improved.

In this embodiment, after the audio corresponding to the target text is acquired, the execution main body may output the acquired audio. Specifically, if the execution main body is a terminal device, the execution main body may directly output the audio corresponding to the target text to the user; if the execution main body is a server in communication connection with the terminal device, the execution main body may send the audio corresponding to the target text to the terminal device, so that the terminal device outputs the audio corresponding to the target text to the user.

With continued reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of the information interaction method according to the present embodiment. In the application scenario of fig. 3, the terminal device 301 may first, in response to receiving the click-to-read request 303 initiated by the user 302, obtain an image to be recognized 304 and voice information 305 input by the user (for example, "how to read this word"), where the image to be recognized 304 may be an image obtained by shooting a click-to-read object pointed by the user 302, and the click-to-read object includes four (i.e., a target number of) characters. The terminal device 301 may then recognize the image to be recognized 304, obtaining a target text 306 (e.g., "dribble") comprising four words. Next, the terminal device 301 may recognize the voice information 305, and obtain intention information 307 (e.g., "recognition vocabulary") for characterizing the user's intention of the user. Finally, the terminal device 301 may obtain audio 308 (e.g., "dribble" audio) corresponding to the target text 306 based on the intention information 307 and send the audio 308 to the user 302.

The method provided by the embodiment of the disclosure can perform point reading based on the gesture and voice of the user, is beneficial to identifying the real intention of the user during the point reading, further outputs the audio matched with the real intention of the user, and improves the accuracy of information interaction.

With further reference to FIG. 4, a flow 400 of yet another embodiment of an information interaction method is shown. The process 400 of the information interaction method includes the following steps:

step 401, in response to receiving a click-to-read request initiated by a user, acquiring an image to be recognized and voice information input by the user.

In this embodiment, an execution main body of the information interaction method (for example, the terminal device shown in fig. 1) may obtain the image to be recognized and the voice information input by the user in response to receiving a click-to-read request initiated by the user through a wired connection manner or a wireless connection manner. The click-to-read request can be used for requesting to acquire the audio of the click-to-read object pointed by the user. The image to be recognized may be an image obtained by photographing a reading target pointed by a user. The click-to-read object may include a target number of words. The target number may be a preset number; alternatively, the target number may be a number determined based on the text pointed to by the user.

Step 402, identifying the image to be identified to obtain a target text comprising a target number of characters.

In this embodiment, based on the image to be recognized obtained in step 401, the executing body may recognize the image to be recognized, and obtain a target text including a target number of characters.

And step 403, recognizing the voice information to obtain intention information for representing the user intention of the user.

In this embodiment, based on the voice information obtained in step 401, the executing entity may recognize the voice information to obtain intention information for representing the user intention of the user. Wherein the intention information can be used to characterize whether the target that the user desires to click-read is a word or a word.

And step 404, acquiring and outputting the audio corresponding to the target text based on the intention information.

In this embodiment, based on the intention information obtained in step 403, the executing entity may obtain the audio corresponding to the target text and output the audio corresponding to the target text. The audio corresponding to the target text may be the audio corresponding to the words included in the target text, or may be the audio corresponding to the words included in the target text, and may be specifically determined based on the intention information.

Steps

401, 402, 403, and 404 may be performed in a manner similar to that of

steps

201, 202, 203, and 204 in the foregoing embodiment, respectively, and the above description for

steps

201, 202, 203, and 204 also applies to

steps

401, 402, 403, and 404, and is not repeated here.

Step 405, obtaining the read-following audio input by the user for the obtained audio.

In this embodiment, after the audio corresponding to the target text is output, the execution main body may obtain the read-after audio of the audio input, which is obtained by the user, by the user.

In practice, after the user acquires the audio corresponding to the target text, the user can imitate the pronunciation in the acquired audio and input the follow-up audio. Specifically, the read-after audio may be an audio acquired by the execution main body within a preset time range (for example, 5 seconds) after the audio corresponding to the target text is output.

As an example, the execution body may first output audio corresponding to the target text. Then, the execution main body detects the voice of the user, and responds to the detection of the voice of the user in a preset time range, and takes the voice of the user as follow-up reading audio.

It should be noted that, if the voice of the user is not detected within a preset time range (for example, 5 seconds) after the audio corresponding to the target text is output, the execution main body may output the audio corresponding to the target text again and detect the voice of the user again.

And step 406, matching the read-after audio with the output audio corresponding to the target text, and obtaining and outputting a matching result.

In this embodiment, based on the read-after audio obtained in step 405, the execution main body may match the read-after audio with the output audio corresponding to the target text, so as to obtain a matching result and output the matching result. The matching result may be used to characterize the similarity between the pronunciation in the reading following audio and the pronunciation in the audio corresponding to the target text, and may include, but is not limited to, at least one of the following: characters, numbers, symbols, images. For example, the matching result may include a number, the larger the number the more similar the pronunciation in the read-along audio may represent to the pronunciation in the audio corresponding to the target text.

Specifically, the execution main body may match the read-after audio with the output audio corresponding to the target text by using various methods to obtain a matching result. For example, the execution subject may first convert the reading-after audio into a reading-after text, then obtain a text corresponding to the output audio as a reference text, and then calculate the similarity between the reading-after text and the reference text as a matching result. Alternatively, the executing body may match the read-after audio with the output audio corresponding to the target text by using an existing gop (goodness of probability) algorithm to obtain a matching result.

As can be seen from fig. 4, compared with the embodiment corresponding to fig. 2, the process 400 of the information interaction method in this embodiment highlights steps of acquiring the reading-after audio input by the user for the audio after outputting the audio corresponding to the target text, and matching the reading-after audio and the audio corresponding to the target text to obtain a matching result and outputting the matching result. Therefore, the scheme described in the embodiment can realize the reading-after function of the user while realizing the reading-on-demand function based on the user intention, and improves the comprehensiveness and diversity of information interaction; and the matching result is fed back to the user, so that the user can determine whether the pronunciation in the follow-up audio is accurate, the user can learn more specifically, and the user experience is improved.

With further reference to fig. 5, as an implementation of the method shown in the above-mentioned figures, the present disclosure provides an embodiment of an information interaction apparatus, which corresponds to the method embodiment shown in fig. 2, and which can be applied in various electronic devices.

As shown in fig. 5, the information interaction apparatus 500 of the present embodiment includes: a first acquisition unit 501, a first recognition unit 502, a second recognition unit 503, and an output unit 504. The first obtaining unit 501 is configured to obtain an image to be recognized and voice information input by a user in response to receiving a click-to-read request initiated by the user, where the image to be recognized is an image obtained by shooting a click-to-read object pointed by the user, and the click-to-read object includes a target number of characters; the first recognition unit 502 is configured to recognize an image to be recognized, and obtain a target text including a target number of characters; the second recognition unit 503 is configured to recognize the voice information, and obtain intention information for characterizing the user intention of the user; the output unit 504 is configured to acquire and output audio corresponding to the target text based on the intention information

In this embodiment, the first obtaining unit 501 of the information interaction apparatus 500 may obtain the image to be recognized and the voice information input by the user in response to receiving a click-to-read request initiated by the user through a wired connection manner or a wireless connection manner. The click-to-read request can be used for requesting to acquire the audio of the click-to-read object pointed by the user. The image to be recognized may be an image obtained by photographing a reading target pointed by a user. The click-to-read object may include a target number of words. The target number may be a preset number; alternatively, the target number may be a number determined based on the text pointed to by the user.

In this embodiment, based on the image to be recognized obtained by the first obtaining unit 501, the first recognizing unit 502 may recognize the image to be recognized, and obtain a target text including a target number of characters.

In this embodiment, based on the voice information obtained by the first obtaining unit 501, the second identifying unit 503 can identify the voice information to obtain intention information for representing the user intention of the user. Wherein the intention information can be used to characterize whether the target that the user desires to click-read is a word or a word.

In this embodiment, based on the intention information obtained by the second identifying unit 503, the output unit 504 may obtain the audio corresponding to the target text and output the audio corresponding to the target text. The audio corresponding to the target text may be the audio corresponding to the words included in the target text, or may be the audio corresponding to the words included in the target text, and may be specifically determined based on the intention information.

In some optional implementations of this embodiment, the output unit 504 may be further configured to: and responding to the intention information representing that the user intention of the user is the recognized characters, and acquiring and outputting the audio corresponding to the characters included in the target text.

In some optional implementations of this embodiment, the output unit 504 may include: a determination module (not shown in the figures) configured to determine whether the target text includes a vocabulary in response to the intention information characterizing the user's user intention as a recognized vocabulary; and a first output module (not shown in the figure) configured to respond to the target text including the vocabulary, and acquire and output the audio corresponding to the vocabulary included in the target text.

In some optional implementations of this embodiment, the output unit 504 may further include: and a second output module (not shown in the figure) configured to acquire and output the audio corresponding to the words included in the target text in response to the target text not including the vocabulary.

In some optional implementations of this embodiment, the apparatus 500 may further include: a second acquisition unit (not shown in the figure) configured to acquire a read-after audio of the user for the acquired audio input; and the matching unit (not shown in the figure) is configured to match the read-after audio with the output audio corresponding to the target text, obtain a matching result and output the matching result.

It will be understood that the elements described in the apparatus 500 correspond to various steps in the method described with reference to fig. 2. Thus, the operations, features and resulting advantages described above with respect to the method are also applicable to the apparatus 500 and the units included therein, and are not described herein again.

The device 500 provided by the above embodiment of the present disclosure can perform point reading based on the gesture and voice of the user, which is helpful for recognizing the real intention of the user during the point reading, and then output the audio matched with the real intention of the user, thereby improving the accuracy of information interaction.

Referring now to fig. 6, a schematic diagram of an electronic device (e.g., the terminal device of fig. 1) 600 suitable for implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 6, electronic device 600 may include a processing means (e.g., central processing unit, graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Generally, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 6 illustrates an electronic device 600 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 609, or may be installed from the storage means 608, or may be installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of the embodiments of the present disclosure.

It should be noted that the computer readable medium in the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: in response to receiving a reading request initiated by a user, acquiring an image to be recognized and voice information input by the user, wherein the image to be recognized is an image obtained by shooting a reading object pointed by the user, and the reading object comprises a target number of characters; identifying an image to be identified to obtain a target text comprising a target number of characters; recognizing the voice information to obtain intention information for representing the user intention of the user; and acquiring and outputting the audio corresponding to the target text based on the intention information.

Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. Here, the name of the unit does not constitute a limitation of the unit itself in some cases, and for example, the first acquisition unit may also be described as "a unit that acquires image and voice information to be recognized".

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Claims

1. An information interaction method comprises the following steps:

responding to a click-to-read request initiated by a user, and acquiring an image to be recognized and voice information input by the user, wherein the image to be recognized is an image obtained by shooting a click-to-read object pointed by the user, and the click-to-read object comprises a target number of characters;

identifying the image to be identified to obtain a target text comprising the target number of characters;

identifying the voice information to obtain intention information for representing the user intention of the user;

and acquiring and outputting the audio corresponding to the target text based on the intention information.

2. The method of claim 1, wherein the obtaining and outputting audio corresponding to the target text based on the intent information comprises:

and responding to the intention information representing that the user intention of the user is the recognized characters, and acquiring and outputting the audio corresponding to the characters included in the target text.

3. The method of claim 1, wherein the obtaining and outputting audio corresponding to the target text based on the intent information comprises:

in response to the intent information characterizing the user intent of the user as recognizing vocabulary, determining whether the target text includes vocabulary;

and responding to the fact that the target text comprises the vocabulary, and acquiring and outputting the audio corresponding to the vocabulary comprised by the target text.

4. The method of claim 3, wherein the obtaining and outputting audio corresponding to the target text based on the intent information further comprises:

and responding to the situation that the target text does not comprise words, and acquiring and outputting audio corresponding to the words comprised by the target text.

5. The method according to one of claims 1-4, wherein after said obtaining and outputting audio corresponding to said target text, said method further comprises:

acquiring follow-up reading audio input by the user aiming at the acquired audio;

and matching the follow-up reading audio with the output audio corresponding to the target text to obtain a matching result and output the matching result.

6. An information interaction device, comprising:

the device comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is configured to respond to a click-to-read request initiated by a user and acquire an image to be recognized and voice information input by the user, the image to be recognized is an image obtained by shooting a click-to-read object pointed by the user, and the click-to-read object comprises a target number of characters;

the first identification unit is configured to identify the image to be identified and obtain a target text comprising the target number of characters;

a second recognition unit configured to recognize the voice information and obtain intention information for representing the user intention of the user;

and the output unit is configured to acquire and output the audio corresponding to the target text based on the intention information.

7. The apparatus of claim 6, wherein the output unit is further configured to:

8. The apparatus of claim 6, wherein the output unit comprises:

a determination module configured to determine whether the target text includes a vocabulary in response to the intent information characterizing a user intent of the user as identifying the vocabulary;

and the first output module is configured to respond to the target text including the vocabulary, and acquire and output the audio corresponding to the vocabulary included in the target text.

9. The apparatus of claim 8, wherein the output unit further comprises:

and the second output module is configured to respond to the situation that the target text does not comprise words, and acquire and output audio corresponding to the words comprised by the target text.

10. The apparatus according to one of claims 6-9, wherein the apparatus further comprises:

a second acquisition unit configured to acquire a read-after audio input by the user for the acquired audio input;

and the matching unit is configured to match the reading-after audio with the output audio corresponding to the target text, obtain a matching result and output the matching result.

11. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-5.

12. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-5.