CN111191640A

CN111191640A - Three-dimensional scene presenting method, device and system

Info

Publication number: CN111191640A
Application number: CN202010186321.0A
Authority: CN
Inventors: 郭艳
Original assignee: Chengdu Huiyi Noga Culture Communication Co Ltd
Current assignee: Chengdu Huiyi Noga Culture Communication Co Ltd
Priority date: 2017-03-30
Filing date: 2017-03-30
Publication date: 2020-05-22
Anticipated expiration: 2037-03-30
Also published as: CN106951881A; CN106951881B; CN111191640B

Abstract

The invention discloses a three-dimensional scene presenting method, a device and a system, wherein the method comprises the following steps: scanning a picture to be processed once through a camera and carrying out image recognition on the picture to be processed so as to judge whether the picture to be processed is a preset scene picture; if the picture to be processed is a preset scene picture, displaying a preset keyword associated with the preset scene picture to a user; acquiring voice information provided by a user aiming at a preset keyword, and carrying out voice recognition on the voice information to judge whether the preset keyword is included and acquire the preset keyword included by the voice information; scanning the picture to be processed for the second time through the camera, and displaying the picture of the picture to be processed in the scanning area in the screen; three-dimensional objects associated with predetermined keywords included in the voice information are acquired and loaded onto the image to present a three-dimensional scene to the user.

Description

Three-dimensional scene presenting method, device and system

The invention is divisional application of invention patent application 201710203721.6 applied on 3, 30 and 2017.

Technical Field

The invention relates to the technical field of image processing, in particular to a three-dimensional scene presenting method, device and system.

Background

With the development of image processing and related technologies, more and more people are invested in the research of AR (Augmented Reality) technology. The AR technology is scene synthesis on the basis of reality, adds a part of information to expand data mastered by people, applies virtual information to the real world, mashups real environment and virtual objects to be overlaid to the same space, the same scene and the same picture in real time, relates to various technologies such as image recognition, image matching, three-dimensional modeling, video display and control and can be applied to multiple fields such as teaching, advertising, retail industry, medical health and entertainment games.

For example, in the field of teaching, the development of AR products such as AR books is more rapid. The AR book applies the AR technology to the book as the name implies, and has the greatest characteristic that static pictures and texts are alive. In general, a user scans pictures on a designated page of an AR book through a camera of a mobile terminal such as a mobile phone or a tablet to perform image recognition, and if the image recognition is successful, a simple animation or a three-dimensional model is displayed in an application associated with the AR book in the mobile terminal, so that an overall three-dimensional scene is presented to the user. However, there is hardly any interaction with the user in this process, the display cannot be performed according to the needs of the user, and the actual manual operation opportunity is not provided, so that the user experience is poor, and the attraction is low for a specific user such as a child. Furthermore, the above problems occur in not only AR books but also other AR products, and thus a new three-dimensional scene rendering method is required to optimize the above process.

Disclosure of Invention

To this end, the present invention provides a solution for three-dimensional scene rendering in an attempt to solve or at least alleviate the above-existing problems.

According to an aspect of the present invention, there is provided a three-dimensional scene presenting method adapted to be executed in a mobile terminal, the mobile terminal including a data storage device, the data storage device storing therein a predetermined keyword associated with a predetermined scene picture and a three-dimensional object associated with the predetermined keyword, the method including the steps of: scanning a picture to be processed once through a camera and carrying out image recognition on the picture to be processed so as to judge whether the picture to be processed is a preset scene picture; if the picture to be processed is a preset scene picture, displaying a preset keyword associated with the preset scene picture to a user; acquiring voice information provided by a user aiming at a preset keyword, and carrying out voice recognition on the voice information to judge whether the preset keyword is included and acquire the preset keyword included by the voice information; scanning the picture to be processed for the second time through the camera, and displaying the picture of the picture to be processed in the scanning area in the screen; three-dimensional objects associated with predetermined keywords included in the voice information are acquired from the data storage device and loaded onto the image to present the three-dimensional scene to the user.

Optionally, in the three-dimensional scene presenting method according to the present invention, the data storage device stores an image feature set corresponding to a predetermined scene picture, and the step of scanning the picture to be processed once by the camera and recognizing the image of the picture to be processed to determine whether the picture to be processed is the predetermined scene picture includes: starting a camera to scan a picture to be processed; acquiring an image of a picture to be processed in a scanning area; extracting feature points of the image to generate a feature set to be identified; acquiring an image feature set from data storage equipment, and performing feature matching on the feature set to be identified and the image feature set; and if the matching is successful, judging that the picture to be processed is a preset scene picture.

Optionally, in the three-dimensional scene presenting method according to the present invention, the image feature set includes a plurality of image feature points, and the step of performing feature matching between the feature set to be identified and the image feature set includes: carrying out feature matching on the feature set to be identified and the image feature set, and counting the number of feature points successfully matched to serve as the number of matched pairs; acquiring the number of feature points of a feature set to be identified as a first number, and acquiring the number of image feature points in the image feature set as a second number; calculating the ratio of the number of the matching pairs to the smaller value of the first number and the second number as the image matching degree; and if the image matching degree is greater than the first threshold value, judging that the image feature set is successfully matched.

Optionally, in the three-dimensional scene presenting method according to the present invention, the mobile terminal is in communication connection with a network server, the network server stores a predetermined keyword associated with a predetermined scene picture, and the step of performing voice recognition on the voice information to determine whether the predetermined keyword is included and obtain the predetermined keyword included in the voice information includes: sending the voice information to a network server, and indicating the network server to perform voice recognition on the voice information so as to judge whether the voice information comprises a preset keyword or not; and receiving a voice recognition result returned by the network server, and acquiring a preset keyword included in the voice information according to the voice recognition result.

Optionally, in the three-dimensional scene presenting method according to the present invention, the step of performing voice recognition on the voice information by the network server to determine whether the predetermined keyword is included therein includes: receiving voice information sent by a mobile terminal and carrying out voice recognition on the voice information; if the voice information comprises the preset keywords, taking the recognized preset keywords as voice recognition results, and sending the voice recognition results to the corresponding mobile terminals; and if the voice information does not contain the preset keywords, the failure of voice recognition is used as a voice recognition result, and the voice recognition result is sent to the corresponding mobile terminal.

Optionally, in the three-dimensional scene presenting method according to the present invention, the step of scanning the picture to be processed twice by the camera, and displaying the image of the picture to be processed in the scanning area on the screen includes: secondarily scanning the picture to be processed through the camera and carrying out image recognition on the picture to be processed so as to judge whether the picture to be scanned is a preset scene picture; and if the picture to be scanned is a preset scene picture, displaying an image of the picture to be processed in the scanning area in the screen.

Alternatively, in the three-dimensional scene presenting method according to the present invention, the step of storing, in the data storage device, position information associated with the three-dimensional object, the position information being used to display the three-dimensional object at a predetermined position in a predetermined scene picture, and acquiring, from the data storage device, the three-dimensional object associated with a predetermined keyword included in the voice information, and loading the three-dimensional object onto the image includes: acquiring a three-dimensional object associated with a preset keyword from a data storage device according to the preset keyword included in the voice information; according to the obtained three-dimensional object, obtaining position information associated with the three-dimensional object from data storage equipment; and loading the three-dimensional object associated with the position information to the corresponding position in the image according to the position information.

Optionally, in the three-dimensional scene presenting method according to the present invention, the method further includes: and playing the voice information while presenting the three-dimensional scene to the user.

According to still another aspect of the present invention, there is provided a three-dimensional scene presenting apparatus adapted to reside in a mobile terminal, the mobile terminal including a data storage device, the data storage device storing therein a predetermined keyword associated with a predetermined scene picture and a three-dimensional object associated with the predetermined keyword, the apparatus including an image recognition module, a first display module, a voice processing module, a second display module, and a loading module. The image recognition module is suitable for scanning a picture to be processed once through a camera and carrying out image recognition on the picture to be processed so as to judge whether the picture to be processed is a preset scene picture; the first display module is suitable for displaying a preset keyword associated with a preset scene picture to a user when the picture to be processed is the preset scene picture; the voice processing module is suitable for acquiring voice information provided by a user aiming at the preset keywords, and performing voice recognition on the voice information to judge whether the preset keywords are included and acquire the preset keywords included by the voice information; the second display module is suitable for scanning the picture to be processed for the second time through the camera and displaying the image of the picture to be processed in the scanning area in the screen; the loading module is adapted to retrieve a three-dimensional object associated with a predetermined keyword comprised by the speech information from the data storage device and load it onto the image to present the three-dimensional scene to the user.

Optionally, in the three-dimensional scene rendering apparatus according to the present invention, the data storage device stores therein an image feature set corresponding to a predetermined scene picture, and the image recognition module is further adapted to: starting a camera to scan a picture to be processed; acquiring an image of a picture to be processed in a scanning area; extracting feature points of the image to generate a feature set to be identified; acquiring an image feature set from data storage equipment, and performing feature matching on the feature set to be identified and the image feature set; and when the matching is successful, judging that the picture to be processed is a preset scene picture.

Optionally, in the three-dimensional scene rendering apparatus according to the present invention, the image feature set includes a plurality of image feature points, and the image recognition module is further adapted to: carrying out feature matching on the feature set to be identified and the image feature set, and counting the number of feature points successfully matched to serve as the number of matched pairs; acquiring the number of feature points of a feature set to be identified as a first number, and acquiring the number of image feature points in the image feature set as a second number; calculating the ratio of the number of the matching pairs to the smaller value of the first number and the second number as the image matching degree; and when the image matching degree is greater than a first threshold value, judging that the image feature set is successfully matched.

Optionally, in the three-dimensional scene presenting apparatus according to the present invention, the mobile terminal is in communication connection with a network server, the network server stores a predetermined keyword associated with a predetermined scene picture, and the voice processing module is further adapted to: sending the voice information to a network server, and indicating the network server to perform voice recognition on the voice information so as to judge whether the voice information comprises a preset keyword or not; and receiving a voice recognition result returned by the network server, and acquiring a preset keyword included in the voice information according to the voice recognition result.

Optionally, in the three-dimensional scene rendering apparatus according to the present invention, the second display module is further adapted to: the method comprises the steps of scanning a picture to be processed for the second time through a camera, calling an image recognition module to perform image recognition on the picture to be processed, and judging whether the picture to be scanned is a preset scene picture; and when the picture to be scanned is a preset scene picture, displaying an image of the picture to be processed in the scanning area in the screen.

Optionally, in the three-dimensional scene rendering apparatus according to the present invention, the data storage device stores therein position information associated with the three-dimensional object, the position information being used for displaying the three-dimensional object at a predetermined position in a predetermined scene picture, the loading module is further adapted to: acquiring a three-dimensional object associated with a preset keyword from a data storage device according to the preset keyword included in the voice information; according to the obtained three-dimensional object, obtaining position information associated with the three-dimensional object from data storage equipment; and loading the three-dimensional object associated with the position information to the corresponding position in the image according to the position information.

Optionally, in the three-dimensional scene presenting apparatus according to the present invention, further comprising a playing module adapted to: and playing the voice information while presenting the three-dimensional scene to the user.

According to still another aspect of the present invention, there is provided a mobile terminal including the three-dimensional scene rendering apparatus according to the present invention.

According to yet another aspect of the present invention, there is provided a mobile terminal comprising one or more processors, a memory, a camera, a display screen, and one or more programs stored in the memory, wherein the one or more programs include instructions for performing the three-dimensional scene rendering method according to the present invention and are configured to be executed by the one or more processors to invoke the camera to perform a scanning process to render a three-dimensional scene on the display screen.

According to still another aspect of the present invention, there is also provided a three-dimensional scene rendering system including a plurality of mobile terminals according to the present invention and a network server according to the present invention.

According to the technical scheme of the three-dimensional scene presentation, firstly, a to-be-processed picture is scanned once through a camera and subjected to image recognition, if the to-be-processed picture is a preset scene picture, preset keywords related to the preset scene picture are displayed, voice information provided by a user aiming at the preset keywords is obtained and subjected to voice recognition, the preset keywords included by the voice information are obtained, the to-be-processed picture is scanned twice, the image located in a scanning area is displayed, a three-dimensional object related to the preset keywords included by the voice information is obtained from a data storage device and loaded onto the image, and the three-dimensional scene is presented to the user. In the technical scheme, the preset keywords related to the preset scene picture are displayed on the screen for the user to select, and the user can select one or more preset keywords at the moment and send the voice information consisting of the preset keywords to the mobile terminal, so that the actual manual operation opportunity is provided, and the interactive experience of the user is enhanced. Moreover, when voice information recognition is performed, considering that the network server generally has higher operation speed and hardware configuration, the process of voice recognition can be executed by calling the network server in communication connection with the mobile terminal, so that on one hand, the efficiency of voice recognition is improved, the time cost is saved, and on the other hand, the operation load of the mobile terminal is reduced. After the picture to be processed is scanned for the second time through the camera, the picture to be processed is subjected to image recognition again, so that the scanned picture is still the preset scene picture, and other pictures which are not the preset scene picture are prevented from being displayed in a screen due to errors of the picture scanned for the second time. And finally, when a three-dimensional scene is presented, the corresponding three-dimensional object is loaded to the corresponding position in the image according to the position information associated with the three-dimensional object to be displayed, and meanwhile, the visual picture is expanded to the audio-visual multi-directional perception for the user to play the voice information provided by the user aiming at the preset keyword, so that the user experience is greatly improved, particularly, the attraction to specific users such as children can be improved, and further, the rapid development of the AR technology is facilitated.

Drawings

To the accomplishment of the foregoing and related ends, certain illustrative aspects are described herein in connection with the following description and the annexed drawings, which are indicative of various ways in which the principles disclosed herein may be practiced, and all aspects and equivalents thereof are intended to be within the scope of the claimed subject matter. The above and other objects, features and advantages of the present disclosure will become more apparent from the following detailed description read in conjunction with the accompanying drawings. Throughout this disclosure, like reference numerals generally refer to like parts or elements.

FIG. 1 shows a schematic diagram of a three-dimensional scene rendering system 100 according to one embodiment of the invention;

FIG. 2 illustrates a block diagram of a mobile terminal 200 according to one embodiment of the present invention;

FIG. 3 illustrates a flow diagram of a three-dimensional scene rendering method 300 according to one embodiment of the invention;

FIG. 4 shows a schematic diagram of a three-dimensional scene rendering apparatus 400 according to an embodiment of the invention;

FIG. 5 shows a schematic diagram of a three-dimensional scene rendering apparatus 500 according to yet another embodiment of the invention; and

fig. 6 shows a schematic diagram of a network server 600 according to one embodiment of the invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Fig. 1 shows a schematic view of a three-dimensional scene rendering system 100 according to an embodiment of the invention. It should be noted that the three-dimensional scene presenting system 100 in fig. 1 is only exemplary, in a specific practical situation, there may be different numbers of mobile terminals and network servers in the three-dimensional scene presenting system 100, and the present invention does not limit the number of mobile terminals and network servers included in the three-dimensional scene presenting system 100. As shown in fig. 1, the three-dimensional scene rendering system 100 includes a mobile terminal 200 and a web server 600. The network server 600 is in communication connection with the mobile terminal 200, and the mobile terminal 200 may be a smart phone, a tablet computer, or the like, but is not limited thereto. The mobile terminal 200 includes therein a data storage device (not shown in the drawings) in which a predetermined keyword associated with a predetermined scene picture and a three-dimensional object associated with the predetermined keyword are stored.

In the three-dimensional scene presenting system 100, a user firstly scans a to-be-processed picture through a camera of the mobile terminal 200, the mobile terminal 200 performs image recognition on the scanned image to determine whether the scanned image is a predetermined scene picture, and if the to-be-processed picture is the predetermined scene picture, a predetermined keyword associated with the predetermined scene picture is displayed to the user on a screen. At this time, the user may select a part or all of the predetermined keywords from the screen, send the voice information composed of the predetermined keywords to the mobile terminal 200, and perform voice recognition on the voice information after the voice information is recorded in the mobile terminal 200. In this embodiment, the mobile terminal 200 transmits the voice message to the web server 600, the web server 600 performs a voice recognition process to determine whether the voice message includes a predetermined keyword, and if the voice message includes the predetermined keyword, the recognized predetermined keyword is used as a voice recognition result and is transmitted to the mobile terminal 200. It should be noted that the processing procedure for voice recognition can also be executed in the mobile terminal 200, and is not limited herein. After the mobile terminal 200 receives the voice recognition result, the user may scan the picture to be processed again through the camera of the mobile terminal 200, and if the picture to be processed is the predetermined scene picture, an image of the picture to be processed in the scanning area may be displayed in the screen. And finally, acquiring the three-dimensional object associated with the preset keyword included by the voice information from the data storage device, and loading the three-dimensional object to the image so as to present a three-dimensional scene to the user, and simultaneously playing the voice information provided by the user aiming at the selected preset keyword, thereby bringing audio-visual multi-directional experience to the user.

Fig. 2 shows a block diagram of a mobile terminal 200 according to an embodiment of the present invention. The mobile terminal 200 may include a memory interface 202, one or more data processors, image processors and/or central processing units 204, and a peripheral interface 206.

The memory interface 202, the one or more processors 204, and/or the peripherals interface 206 can be discrete components or can be integrated in one or more integrated circuits. In the mobile terminal 200, the various elements may be coupled by one or more communication buses or signal lines. Sensors, devices, and subsystems can be coupled to peripheral interface 206 to facilitate a variety of functions.

For example, a motion sensor 210, a light sensor 212, and a distance sensor 214 may be coupled to the peripheral interface 206 to facilitate directional, lighting, and ranging functions. Other sensors 216 may also be coupled to the peripheral interface 206, such as a positioning system (e.g., a GPS receiver), a temperature sensor, a biometric sensor, or other sensing device, to facilitate related functions.

Camera subsystem 220 and optical sensor 222, which may be, for example, a charge-coupled device (CCD) or complementary metal-oxide-semiconductor (centimeter OS) optical sensor, may be used to facilitate implementation of camera functions such as recording photographs and video clips. Communication functions may be facilitated by one or more wireless communication subsystems 224, which may include radio frequency receivers and transmitters and/or optical (e.g., infrared) receivers and transmitters. The particular design and implementation of wireless communication subsystem 224 may depend on one or more communication networks supported by mobile terminal 200. For example, the mobile terminal 200 may include a network designed to support LTE, 3G, GSM networks, GPRS networks, EDGE networks, Wi-Fi or WiMax networks, and Bluetooth^TM A communication subsystem 224 of the network.

The audio subsystem 226 may be coupled with a speaker 228 and a microphone 230 to facilitate voice-enabled functions, such as voice recognition, voice replication, digital recording, and telephony functions. The I/O subsystem 240 may include a touchscreen controller 242 and/or one or more other input controllers 244. The touch screen controller 242 may be coupled to a touch screen 246. For example, the touch screen 246 and touch screen controller 242 may detect contact and movement or pauses made therewith using any of a variety of touch sensing technologies, including but not limited to capacitive, resistive, infrared, and surface acoustic wave technologies. One or more other input controllers 244 can be coupled to other input/control devices 248, such as one or more buttons, rocker switches, thumbwheels, infrared ports, USB ports, and/or pointing devices such as styluses. The one or more buttons (not shown) may include up/down buttons for controlling the volume of the speaker 228 and/or the microphone 230.

The memory interface 202 may be coupled with a memory 250. The memory 250 may include high-speed random access memory and/or non-volatile memory, such as one or more magnetic disk storage devices, one or more optical storage devices, and/or flash memory (e.g., NAND, NOR). The memory 250 may store an operating system 272, such as an operating system like Android, iOS, or Windows Phone. The operating system 272 may include instructions for handling basic system services and for performing hardware dependent tasks. The memory 250 may also store applications 274. While the mobile device is running, the operating system 272 is loaded from memory 250 and executed by the processor 204. Applications 274, when running, are also loaded from memory 250 and executed by processor 204. The applications 274 run on top of the operating system and utilize interfaces provided by the operating system and underlying hardware to implement various user-desired functions, such as instant messaging, web browsing, picture management, and the like. The application 274 may be provided separately from the operating system or may be native to the operating system. In addition, a driver module may also be added to the operating system when the application 274 is installed in the mobile terminal 200. Among the various applications 274 described above, one of them is a three-dimensional scene rendering device 400 according to the present invention. The application 274 further includes a data storage device 290 according to the present invention, and the data storage device 290 stores therein a predetermined keyword associated with a predetermined scene picture, and a three-dimensional object associated with the predetermined keyword.

FIG. 3 shows a flow diagram of a three-dimensional scene rendering method 300 according to one embodiment of the invention. The three-dimensional scene rendering method 300 is suitable for execution in a mobile terminal 200, such as the mobile terminal 200 shown in fig. 2.

As shown in fig. 3, the method 300 begins at step S310. In step S310, the picture to be processed is scanned once by the camera and subjected to image recognition to determine whether the picture to be processed is a predetermined scene picture. According to an embodiment of the present invention, the data storage device 290 stores therein an image feature set corresponding to a predetermined scene picture. According to this embodiment, whether or not the picture to be processed is a predetermined scene picture can be determined in the following manner. Firstly, a user starts a camera through the mobile terminal 200 to scan a picture to be processed, the mobile terminal 200 acquires an image of the picture to be processed in a scanning area after the scanning is finished, and characteristic points of the image are extracted to generate a characteristic set to be identified. Regarding the process of extracting the image feature points to generate the feature set to be identified, it can be performed in the following manner. Firstly, generating a corresponding image scale space according to the image, then detecting a local extreme point in the image scale space, and then accurately positioning the local extreme point by removing a low-contrast point and an edge response point to finally obtain a characteristic point capable of reflecting the image characteristic. In this embodiment, a total of 189 feature points capable of reflecting image features are provided. When describing the feature points, the main direction of each extreme point is calculated, histogram gradient direction statistics is carried out on the region with the extreme point as the center, and a feature descriptor is generated. And generating a feature set to be identified by the feature points capable of reflecting the image features. The above algorithm for extracting feature points from an image may be selected from well-established algorithms related to feature point extraction in the prior art, which are not described herein in detail, and all of these algorithms are easily conceivable by those skilled in the art and are within the scope of the present invention.

After the feature set to be identified corresponding to the picture to be processed is generated, the image feature set is obtained from the data storage device 290, the feature set to be identified and the image feature set are subjected to feature matching, and if the matching is successful, the picture to be processed is judged to be a preset scene picture. According to one embodiment of the present invention, the set of image features stored in the data storage device 290 includes a plurality of image feature points. In this embodiment, the data storage device 290 stores 50 image matching sets, each having a name a 1-a 50 in sequence and corresponding to different predetermined scene pictures P1-P50, in other words, there is a one-to-one correspondence relationship between the image matching sets and the predetermined scene pictures, so that the predetermined keywords associated with the predetermined scene pictures can be regarded as being associated with the image feature sets corresponding to the predetermined scene pictures, that is, each image matching set has the predetermined keywords associated therewith. Table 1 shows an example of storing data related to an image matching set according to an embodiment of the present invention, and for convenience of description, table 1 does not include a feature descriptor corresponding to each image feature point, a name of a predetermined scene picture, and a three-dimensional object associated with each predetermined keyword, which is specifically shown as follows:

image matching set	Number of image feature points	Predetermined keywords
			A1	276	Tiger, fox, little rabbit, radish, forest
A2	179	Small fish, small shrimp, aquatic plant and pond
			A3	225	Benchi, panda, chick, monkey, parrot, circus
……	……	……
			A25	78	Small monk, two monks, big monks, bucket, well and temple
A26	357	Tadpole, carp, tortoise, frog, lotus leaf and brook
			……	……	……
A49	196	Piggy, wolf, grass house, wood house, brick house, grassland
			A50	208	Desk, chair, blackboard, chalk, window and classroom

TABLE 1

As shown in table 1, the image matching sets a 1-a 50 respectively have corresponding image feature points and associated predetermined keywords, for example, the image matching set a1 includes 276 image feature points, and the associated predetermined keywords are 5, which are respectively tiger, fox, little white rabbit, radish and forest. Generally, when performing feature matching, feature matching is performed on the feature set to be identified and the image matching sets a 1-a 50 in sequence, in this process, if a certain image matching set is successfully matched, it is determined that the picture to be processed is a predetermined scene picture corresponding to the successfully matched image matching set, and if no successfully matched image matching set exists, it is determined that the picture to be processed is not any predetermined scene picture. According to this embodiment, the set of features to be identified may be feature matched with the set of image matches by the following method. Firstly, feature matching is carried out on a feature set to be identified and the image feature set, and the number of feature points which are successfully matched is counted to serve as the number of matching pairs. The algorithm for matching features may be selected from well-established algorithms related to feature matching in the prior art, which are not described herein in detail, and all of these algorithms are easily conceivable by those skilled in the art and are within the scope of the present invention. Taking the image matching set a1 as an example, after the feature set to be identified is feature-matched with the image matching set a1, the number of feature points with successful matching is counted to be 30, and then the number of matching pairs is counted to be 30. Next, the number of feature points of the feature set to be identified is obtained as a first number, the number of image feature points in the image feature set a1 is obtained as a second number, the first number and the second number are obtained as 189 and 225, respectively, and a ratio of the number of matching pairs to a smaller value of the first number and the second number is calculated as an image matching degree, and the obtained image matching degree is 30/189-0.159. And finally, comparing the image matching degree with a first threshold value, and if the image matching degree is greater than the first threshold value, judging that the image feature set is successfully matched. In this embodiment, preferably, the first threshold is 0.75, and since the image matching degree is greater than the first threshold and the image matching degree is much smaller than the first threshold, it is determined that the image feature set a1 fails to match, and at this time, feature matching is continuously performed on the feature set to be identified and the remaining image matching sets in sequence until an image feature set that matches successfully appears. According to the embodiment, the successfully matched image feature set is finally obtained as the image feature set A3, which indicates that the picture to be processed is the predetermined scene picture P3 corresponding to the image feature set A3.

Subsequently, step S320 is performed, and if the to-be-processed picture is a predetermined scene picture, a predetermined keyword associated with the predetermined scene picture is displayed to the user. According to an embodiment of the present invention, the picture to be processed is the predetermined scene picture P3, and 6 predetermined keywords associated with the predetermined scene picture P3 are displayed to the user, respectively, a galloping, a panda, a chick, a monkey, a parrot, and a circus.

Next, in step S330, voice information provided by the user for the predetermined keyword is acquired, and voice recognition is performed on the voice information to determine whether the predetermined keyword is included therein and acquire the predetermined keyword included in the voice information. According to an embodiment of the present invention, after the 6 predetermined keywords are displayed on the screen of the mobile terminal 200, the user may select 1 or more of the 6 predetermined keywords to organize the language to form a content such as a short sentence or a small story, and express the content by way of narration so that the mobile terminal 200 records the content to obtain the corresponding voice information. For example, before the user speaks the content, the user may click a button such as "ok" or "start recording" on the screen to trigger a recording event, and at this time, the mobile terminal 200 will respond to the recording operation of the user to obtain the voice information provided by the user for the predetermined keyword. In this embodiment, the recorded voice message is "chicken dancing on stage, monkey is performing a drill fire circle, parrotkine chirped and audience calls, very busy".

After the voice information is acquired, voice recognition is started for the voice information. According to an embodiment of the present invention, the mobile terminal 200 is communicatively connected to the web server 600, and the web server 600 stores therein a predetermined keyword associated with a predetermined scene picture, so that the web server 600 may be used to perform a voice recognition process in this embodiment. First, the mobile terminal 200 transmits voice information to the web server 600, and instructs the web server 600 to perform voice recognition on the voice information to determine whether a predetermined keyword is included therein. After receiving the voice information sent by the mobile terminal 200, the network server 600 performs voice recognition on the voice information, if the voice information includes a preset keyword, the recognized preset keyword is used as a voice recognition result, and if the voice information does not include the preset keyword, the voice recognition failure is used as a voice recognition result, and then the voice recognition result is sent to the corresponding mobile terminal. The algorithm for performing speech recognition may be selected from well-established algorithms related to speech recognition in the prior art, which are not described herein in detail, and all of which are easily conceivable by those skilled in the art and are within the scope of the present invention. For the speech information "chicken dancing on stage, monkey drilling fire circle, parrot chamazz and audience calling, very hot", the web server 600 recognizes it, the speech information includes 3 predetermined keywords associated with the predetermined scene picture P3, which are chicken, monkey and parrot, respectively, so that the 3 recognized predetermined keywords are used as the speech recognition result, and the speech recognition result is sent to the mobile terminal 200. At this time, the mobile terminal 200 receives the voice recognition result returned by the network server 600, and obtains the predetermined keywords included in the voice information according to the voice recognition result, and finally obtains the 3 predetermined keywords of the chicken, the monkey, and the parrot.

According to still another embodiment of the present invention, although the predetermined keywords associated with the predetermined scene picture P3 are displayed to the user in step S320, the user does not provide corresponding voice information for the predetermined keywords, so that the web server 600 determines that the predetermined keywords are not included in the voice information after performing voice recognition on the voice information, and thus transmits a voice recognition failure as a voice recognition result to the mobile terminal 200. After receiving the voice recognition result returned by the network server, the mobile terminal 200 may prompt the user to send out the voice information containing the predetermined keyword again so as to present the corresponding three-dimensional scene in the following.

After the predetermined keyword included in the voice information is acquired, step S340 is started to perform secondary scanning on the picture to be processed through the camera, and an image of the picture to be processed in the scanning area is displayed in the screen. According to one embodiment of the invention, after the image to be processed is scanned by the camera for the second time to obtain the image in the scanning area, the image is firstly subjected to image recognition processing to judge whether the image to be scanned is a preset scene image. The image recognition processing is performed on the scanned image, so that the phenomenon that a preset scene picture which does not correspond to the subsequently loaded three-dimensional object is displayed in a screen due to the fact that the picture to be processed of the secondary scanning and the picture to be processed of the primary scanning of the user are inconsistent is avoided. For the specific steps of image recognition, reference may be made to the processing procedure of image recognition on the picture to be processed in step S310, which is not described herein again. Further, if the picture to be scanned is the preset scene picture, the image of the picture to be processed in the scanning area is displayed in the screen, and if the picture to be scanned is not the preset scene picture, the user can be prompted to scan the picture by mistake and please scan the picture again. In this embodiment, if the picture to be processed scanned twice by the camera is the predetermined scene picture P3, an image of the predetermined scene picture P3 located in the scanning area is displayed on the screen.

Finally, step S350 is entered, and the three-dimensional object associated with the predetermined keyword included in the voice information is acquired from the data storage 290 and loaded onto the image to present the three-dimensional scene to the user. In the data storage device 290, according to an embodiment of the present invention, position information associated with a three-dimensional object is stored, the position information is used for displaying the three-dimensional object at a predetermined position in a predetermined scene picture, and table 2 shows a storage example of the position information associated with the three-dimensional object according to an embodiment of the present invention, and for convenience of description, only the name of the predetermined scene picture, a predetermined keyword, the three-dimensional object, and the position information are shown in table 2, specifically as follows:

TABLE 2

As is apparent from table 2, the data storage device 290 collectively stores 279 three-dimensional objects D1 to D279 associated with each predetermined keyword, respectively, and positional information S1 to S279 sequentially corresponding to the three-dimensional objects D1 to D279. According to the contents in table 2, a three-dimensional object associated with a predetermined keyword included in the voice information can be loaded on the image in the following manner. First, according to the predetermined keyword included in the voice information, the three-dimensional object associated with the predetermined keyword is obtained from the data storage device 290, and in step S330, the predetermined keyword included in the voice information is a chick, a monkey, and a parrot, so that the three-dimensional objects D12, D13, and D14 respectively associated with the chick, the monkey, and the parrot can be obtained. Subsequently, position information associated with the three-dimensional object is acquired from the data storage device 290 based on the acquired three-dimensional object, and then position information S12 associated with the three-dimensional object D12 of the predetermined keyword "chicken", position information S13 associated with the three-dimensional object D13 of the predetermined keyword "monkey", and position information S14 associated with the three-dimensional object D14 of the predetermined keyword "parrot" are acquired. And finally, loading the three-dimensional objects D12, D13 and D14 associated with the position information S12, S13 and S14 to corresponding positions in the image, and presenting the corresponding three-dimensional scene to the user. In order to better provide the user with audiovisual enjoyment, according to another embodiment of the present invention, the three-dimensional scene is presented while voice information previously provided by the user for the predetermined keyword is played, thereby further improving the user experience.

Fig. 4 shows a schematic diagram of a three-dimensional scene rendering apparatus 400 according to an embodiment of the invention. As shown in fig. 4, the three-dimensional scene rendering apparatus 400 resides in a mobile terminal 200, and the mobile terminal 200 includes a data storage device 290 and is communicatively connected to a network server 600. The three-dimensional scene representation apparatus 400 includes an image recognition module 410, a first display module 420, a voice processing module 430, a second display module 440, and a loading module 450.

The image recognition module 410 is adapted to scan the picture to be processed once through the camera and perform image recognition on the picture to be processed to determine whether the picture to be processed is a predetermined scene picture. The data storage device 290 stores therein an image feature set corresponding to a predetermined scene picture, and the image recognition module 410 is further adapted to start a camera to scan a picture to be processed; acquiring an image of a picture to be processed in a scanning area; extracting feature points of the image to generate a feature set to be identified; acquiring an image feature set from the data storage device 290, and performing feature matching on the feature set to be identified and the image feature set; and when the matching is successful, judging that the picture to be processed is a preset scene picture. The image feature set includes a plurality of image feature points, and the image identification module 410 is further adapted to perform feature matching on the feature set to be identified and the image feature set, and count the number of feature points successfully matched as the number of matching pairs; acquiring the number of feature points of a feature set to be identified as a first number, and acquiring the number of image feature points in the image feature set as a second number; calculating the ratio of the number of the matching pairs to the smaller value of the first number and the second number as the image matching degree; and when the image matching degree is greater than a first threshold value, judging that the image feature set is successfully matched.

The first display module 420 is connected to the image recognition module 410 and adapted to display a predetermined keyword associated with a predetermined scene picture to a user when the picture to be processed is the predetermined scene picture.

The voice processing module 430 is connected to the first display module 420, and is adapted to acquire voice information provided by a user for a predetermined keyword, perform voice recognition on the voice information to determine whether the predetermined keyword is included therein and acquire the predetermined keyword included in the voice information. The mobile terminal 200 is in communication connection with the web server 600, predetermined keywords associated with a predetermined scene picture are stored in the web server 600, and the voice processing module 430 is further adapted to send voice information to the web server 600, instruct the web server 600 to perform voice recognition on the voice information, so as to determine whether the predetermined keywords are included in the voice information; and receiving a voice recognition result returned by the network server 600, and acquiring a preset keyword included in the voice information according to the voice recognition result.

The second display module 440 is respectively connected to the image recognition module 410 and the voice processing module 430, and is adapted to scan the picture to be processed through the camera for the second time, and display an image of the picture to be processed in the scanning area on the screen. The second display module 440 is further adapted to scan the picture to be processed through the camera for the second time, and invoke the image recognition module to perform image recognition on the picture to be processed, so as to determine whether the picture to be scanned is a predetermined scene picture; and when the picture to be scanned is a preset scene picture, displaying an image of the picture to be processed in the scanning area in the screen.

The loading module 450 is connected to the second display module 440 and is adapted to retrieve a three-dimensional object associated with a predetermined keyword included in the voice information from the data storage 290 and load it onto an image to present a three-dimensional scene to a user. The data storage device 290 stores therein position information associated with the three-dimensional object, the position information being used to display the three-dimensional object at a predetermined position in a predetermined scene picture, the loading module 450 is further adapted to obtain the three-dimensional object associated with a predetermined keyword from the data storage device 290 according to the predetermined keyword included in the voice information; acquiring position information associated with the three-dimensional object from the data storage device 290 according to the acquired three-dimensional object; and loading the three-dimensional object associated with the position information to the corresponding position in the image according to the position information.

Fig. 5 shows a schematic view of a three-dimensional scene rendering apparatus 500 according to yet another embodiment of the invention. As shown in fig. 5, the three-dimensional scene rendering apparatus 500 resides in the mobile terminal 200, and the mobile terminal 200 includes a data storage device 290 and is communicatively connected to a network server 600. The image recognition module 510, the first display module 520, the voice processing module 530, the second display module 540, and the loading module 550 of the three-dimensional scene presenting apparatus 500 correspond to the image recognition module 410, the first display module 420, the voice processing module 430, the second display module 440, and the loading module 450 of the three-dimensional scene presenting apparatus 400 in fig. 4 one-to-one, respectively, and are consistent, and a playing module 560 connected to the loading module 550 is added, and the playing module 560 is adapted to play voice information while presenting a three-dimensional scene to a user.

Fig. 6 shows a schematic diagram of a network server 600 according to one embodiment of the invention. As shown in fig. 6, the web server 600 is in communication connection with the mobile terminal 200, and the web server 600 stores predetermined keywords associated with a predetermined scene picture therein, and includes a receiving module 610, a voice recognition module 620, an obtaining module 630, and a sending module 640.

The receiving module 610 is adapted to receive voice information transmitted by the mobile terminal 200.

The voice recognition module 620 is connected to the receiving module 610 and is adapted to perform voice recognition on the received voice message.

The obtaining module 630 is connected to the voice recognition module 620, and is adapted to use the recognized preset keyword as a voice recognition result when the voice information includes the preset keyword, and use a voice recognition failure as a voice recognition result when the voice information does not include the preset keyword;

the sending module 640 is connected to the obtaining module 630 and is adapted to send the voice recognition result to the corresponding mobile terminal 200.

Specific steps and embodiments of the three-dimensional scene rendering are disclosed in detail in the description based on fig. 3, and are not described herein again.

The existing three-dimensional scene presenting technology hardly has any interactive behavior with users in the whole processing process, cannot display according to the requirements of the users, does not provide practical manual operation opportunities, has poor user experience, and is particularly low in attraction for specific users such as children. According to the technical scheme of the three-dimensional scene presentation, firstly, a to-be-processed picture is scanned once through a camera and subjected to image recognition, if the to-be-processed picture is a preset scene picture, preset keywords related to the preset scene picture are displayed, voice information provided by a user aiming at the preset keywords is obtained and subjected to voice recognition, the preset keywords included by the voice information are obtained, the to-be-processed picture is scanned twice, the image located in a scanning area is displayed, a three-dimensional object related to the preset keywords included by the voice information is obtained from a data storage device, and the three-dimensional object is loaded on the image, so that the three-dimensional scene is presented to the user. In the technical scheme, the preset keywords related to the preset scene picture are displayed on the screen for the user to select, and the user can select one or more preset keywords at the moment and send the voice information consisting of the preset keywords to the mobile terminal, so that the actual manual operation opportunity is provided, and the interactive experience of the user is enhanced. Moreover, when voice information recognition is performed, considering that the network server generally has higher operation speed and hardware configuration, the process of voice recognition can be executed by calling the network server in communication connection with the mobile terminal, so that on one hand, the efficiency of voice recognition is improved, the time cost is saved, and on the other hand, the operation load of the mobile terminal is reduced. After the picture to be processed is scanned for the second time through the camera, the picture to be processed is subjected to image recognition again, so that the scanned picture is still the preset scene picture, and other pictures which are not the preset scene picture are prevented from being displayed in a screen due to errors of the picture scanned for the second time. And finally, when a three-dimensional scene is presented, the corresponding three-dimensional object is loaded to the corresponding position in the image according to the position information associated with the three-dimensional object to be displayed, and meanwhile, the visual picture is expanded to the audio-visual multi-directional perception for the user to play the voice information provided by the user aiming at the preset keyword, so that the user experience is greatly improved, particularly, the attraction to specific users such as children can be improved, and further, the rapid development of the AR technology is facilitated.

A7. The method according to any one of a1-6, wherein the data storage device stores therein position information associated with the three-dimensional object, the position information being used for displaying the three-dimensional object at a predetermined position in the predetermined scene picture, and the step of retrieving from the data storage device the three-dimensional object associated with a predetermined keyword included in the speech information and loading the three-dimensional object onto the image includes:

acquiring a three-dimensional object associated with a preset keyword from the data storage equipment according to the preset keyword included in the voice information;

according to the obtained three-dimensional object, obtaining position information associated with the three-dimensional object from the data storage device;

and loading the three-dimensional object associated with the position information to a corresponding position in the image according to the position information.

A8. The method of any one of a1-8, further comprising:

and playing the voice information while presenting the three-dimensional scene to the user.

B10. The apparatus of B9, the data storage device having stored therein a set of image features corresponding to a picture of a predetermined scene, the image recognition module further adapted to:

starting a camera to scan a picture to be processed;

acquiring an image of a picture to be processed in a scanning area;

extracting feature points of the image to generate a feature set to be identified;

acquiring the image feature set from data storage equipment, and performing feature matching on the feature set to be identified and the image feature set;

and when the matching is successful, judging that the picture to be processed is a preset scene picture.

B11. The apparatus of claim B10, the set of image features comprising a plurality of image feature points, the image recognition module further adapted to:

carrying out feature matching on the feature set to be identified and the image feature set, and counting the number of feature points successfully matched to serve as the number of matched pairs;

acquiring the number of feature points of a feature set to be identified as a first number, and acquiring the number of image feature points in the image feature set as a second number;

calculating the ratio of the number of the matching pairs to the smaller value of the first number and the second number as the image matching degree;

and when the image matching degree is greater than a first threshold value, judging that the image feature set is successfully matched.

B12. The apparatus according to any of B9-11, wherein the mobile terminal is communicatively connected to a web server, and the web server stores therein predetermined keywords associated with predetermined scene pictures, and the speech processing module is further adapted to:

sending the voice information to a network server, and indicating the network server to perform voice recognition on the voice information so as to judge whether the voice information comprises a preset keyword or not;

and receiving a voice recognition result returned by the network server, and acquiring a preset keyword included in the voice information according to the voice recognition result.

B13. The apparatus of any one of B9-12, the second display module further adapted to:

the picture to be processed is scanned secondarily by the camera, and the image identification module is called to process the picture to be processed

Carrying out image identification on the picture to judge whether the picture to be processed is a preset scene picture;

and when the picture to be processed is a preset scene picture, displaying an image of the picture to be processed in the scanning area in a screen.

B14. The apparatus according to any of the preceding claims B9-13, wherein the data storage device has stored therein position information associated with the three-dimensional object, the position information being used to display the three-dimensional object at a predetermined position in the predetermined scene picture, the loading module being further adapted to:

B15. The method of any of B9-14, further comprising a playback module adapted to:

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules or units or groups of devices in the examples disclosed herein may be arranged in a device as described in this embodiment, or alternatively may be located in one or more devices different from the devices in this example. The modules in the foregoing examples may be combined into one module or may be further divided into multiple sub-modules.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. Modules or units or groups in embodiments may be combined into one module or unit or group and may furthermore be divided into sub-modules or sub-units or sub-groups. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

Furthermore, some of the described embodiments are described herein as a method or combination of method elements that can be performed by a processor of a computer system or by other means of performing the described functions. A processor having the necessary instructions for carrying out the method or method elements thus forms a means for carrying out the method or method elements. Further, the elements of the apparatus embodiments described herein are examples of the following apparatus: the apparatus is used to implement the functions performed by the elements for the purpose of carrying out the invention.

The various techniques described herein may be implemented in connection with hardware or software or, alternatively, with a combination of both. Thus, the methods and apparatus of the present invention, or certain aspects or portions thereof, may take the form of program code (i.e., instructions) embodied in tangible media, such as floppy diskettes, CD-ROMs, hard drives, or any other machine-readable storage medium, wherein, when the program is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention.

In the case of program code execution on programmable computers, the computing device will generally include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Wherein the memory is configured to store program code; the processor is configured to execute the three-dimensional scene rendering method of the present invention according to instructions in the program code stored in the memory.

By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer-readable media includes both computer storage media and communication media. Computer storage media store information such as computer readable instructions, data structures, program modules or other data. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. Combinations of any of the above are also included within the scope of computer readable media.

As used herein, unless otherwise specified the use of the ordinal adjectives "first", "second", "third", etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.

While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as described herein. Furthermore, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The present invention has been disclosed in an illustrative rather than a restrictive sense, and the scope of the present invention is defined by the appended claims.

Claims

1. A three-dimensional scene rendering method adapted to be executed in a mobile terminal comprising a data storage device in which predetermined keywords associated with a predetermined scene picture and three-dimensional objects associated with the predetermined keywords are stored, the method comprising:

scanning a picture to be processed once through a camera and carrying out image recognition on the picture to be processed so as to judge whether the picture to be processed is a preset scene picture;

if the picture to be processed is a preset scene picture, displaying a preset keyword associated with the preset scene picture to a user;

acquiring voice information provided by a user aiming at the preset keyword, and carrying out voice recognition on the voice information to judge whether the preset keyword is included and acquire the preset keyword included by the voice information;

scanning the picture to be processed for the second time through the camera, and displaying the picture of the picture to be processed in the scanning area in the screen;

and acquiring a three-dimensional object associated with a preset keyword included in the voice information from a data storage device, and loading the three-dimensional object on the image so as to present a three-dimensional scene to a user.

2. The method as claimed in claim 1, wherein the data storage device stores a set of image characteristics corresponding to a predetermined scene picture, and the step of scanning and recognizing the picture to be processed by the camera once to determine whether the picture to be processed is the predetermined scene picture comprises:

starting a camera to scan a picture to be processed;

acquiring an image of a picture to be processed in a scanning area;

and if the matching is successful, judging that the picture to be processed is a preset scene picture.

3. The method of claim 2, wherein the set of image features comprises a plurality of image feature points, and the step of feature matching the set of features to be identified with the set of image features comprises:

and if the image matching degree is greater than a first threshold value, judging that the image feature set is successfully matched.

4. The method according to any one of claims 1-3, wherein the mobile terminal is communicatively connected to a web server, the web server stores predetermined keywords associated with predetermined scene pictures, and the step of performing voice recognition on the voice message to determine whether the predetermined keywords are included and obtain the predetermined keywords included in the voice message comprises:

5. The method of claim 4, wherein the step of the network server performing voice recognition on the voice message to determine whether the predetermined keyword is included comprises:

receiving voice information sent by a mobile terminal and carrying out voice recognition on the voice information;

if the voice information comprises preset keywords, taking the recognized preset keywords as voice recognition results, and sending the voice recognition results to corresponding mobile terminals;

and if the voice information does not comprise the preset keyword, the failure of voice recognition is used as a voice recognition result, and the voice recognition result is sent to the corresponding mobile terminal.

6. The method according to any one of claims 1-5, wherein the step of scanning the picture to be processed twice through the camera and displaying the image of the picture to be processed in the scanning area in the screen comprises:

secondarily scanning a picture to be processed through a camera and carrying out image recognition on the picture to be processed so as to judge whether the picture to be processed is a preset scene picture;

and if the picture to be processed is the preset scene picture, displaying the image of the picture to be processed in the scanning area in the screen.

7. A three-dimensional scene rendering apparatus adapted to reside in a mobile terminal, the mobile terminal including a data storage device having stored therein a predetermined keyword associated with a predetermined scene picture, and a three-dimensional object associated with the predetermined keyword, the apparatus comprising:

the image recognition module is suitable for scanning a picture to be processed once through a camera and carrying out image recognition on the picture to be processed so as to judge whether the picture to be processed is a preset scene picture;

the first display module is suitable for displaying a preset keyword associated with a preset scene picture to a user when the picture to be processed is the preset scene picture;

the voice processing module is suitable for acquiring voice information provided by a user aiming at the preset keyword, and performing voice recognition on the voice information to judge whether the preset keyword is included and acquire the preset keyword included by the voice information;

the second display module is suitable for scanning the picture to be processed for the second time through the camera and displaying the image of the picture to be processed in the scanning area in the screen;

and the loading module is suitable for acquiring the three-dimensional object associated with the preset keyword included in the voice information from the data storage device and loading the three-dimensional object onto the image so as to present a three-dimensional scene to a user.

8. A mobile terminal comprising the three-dimensional scene rendering apparatus according to claim 7.

9. A mobile terminal, comprising:

one or more processors;

a memory;

a camera;

a display screen; and

one or more programs stored in the memory, wherein the one or more programs include instructions for performing any of the methods of claims 1-6 and are configured for execution by the one or more processors to invoke the camera to perform a scanning process to render a three-dimensional scene on the display screen.

10. A three-dimensional scene rendering system comprising:

a plurality of mobile terminals according to claim 8 or 9; and

a network server as claimed in claim 4 or 5.