CN113311936B

CN113311936B - AR-based voice commenting method, device, equipment and storage medium

Info

Publication number: CN113311936B
Application number: CN202010125581.7A
Authority: CN
Inventors: 范涛; 唐健明; 曾琦娟; 李廷龙; 张瑞
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Chengdu ICT Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Chengdu ICT Co Ltd
Priority date: 2020-02-27
Filing date: 2020-02-27
Publication date: 2022-12-02
Anticipated expiration: 2040-02-27
Also published as: CN113311936A

Abstract

The embodiment of the invention provides an AR-based voice commenting method, an AR-based voice commenting device, AR-based voice commenting equipment and an AR-based voice commenting storage medium, wherein the AR-based voice commenting method is used for a client and comprises the following steps: identifying received first voice information input by a user to obtain a first identification result, wherein the first identification result comprises a first keyword; when the first keyword is a preset comment keyword, sending authentication request information to a server, wherein the authentication request information comprises a user identifier, a client identifier and a scene identifier; receiving an authentication token sent by a server; triggering a comment mode according to the first keyword and the authentication token, and receiving second voice information input by the user; and sending second voice information, a user identifier, a client identifier and comment scene data to a server, wherein the comment scene data comprises a current real object scene picture and one frame of a virtual image when the user starts to comment. The method and the device can realize the voice comment on the AR content, meet the comment requirement of the user, and improve the use experience of the AR application of the user.

Description

AR-based voice commenting method, device, equipment and storage medium

Technical Field

The invention relates to the field of digital services, in particular to an AR-based voice commenting method, device, equipment and storage medium.

Background

Augmented Reality (AR) technology is a technology for skillfully fusing virtual information and a real world, a virtual model is projected to the real world through computer operation, a real environment and a virtual object are superimposed on the same picture or space in real time, and people can interact with the virtual information. When the AR technology is applied to a real scene, a user often faces a demand for commenting or remarking on AR content being played, and when the demand of the user is solved, the prior art is mainly realized through text input.

However, the prior art has the following disadvantages: on one hand, the mode of commenting or remarking through character input is only suitable for mobile phones or handheld devices, and is not suitable for AR glasses or head-mounted devices; on the other hand, when the AR content is commented, the efficiency of manual character input is low, the requirements of the user cannot be well met, and the AR application use experience of the user is reduced.

Disclosure of Invention

The embodiment of the invention provides an AR-based voice commenting method, an AR-based voice commenting device, AR-based voice commenting equipment and an AR application use experience of a user.

In a first aspect, an AR-based voice commenting method is provided, which is used for a client, and includes: identifying received first voice information input by a user to obtain a first identification result, wherein the first identification result comprises a first keyword; when the first keyword is a preset comment keyword, sending authentication request information to a server, wherein the authentication request information comprises a user identifier, a client identifier and a scene identifier, and the authentication request information is used for the server to generate an authentication token according to the user identifier, the client identifier and the scene identifier; receiving an authentication token sent by a server; triggering a comment mode according to the first key words and the authentication token, and receiving second voice information input by a user; and sending second voice information, a user identifier, a client identifier and comment scene data to a server, wherein the comment scene data comprise a current real object scene picture and one frame of a virtual image when the user starts to comment, and the second voice information is comment information of a scene corresponding to the comment scene data by the user.

In some implementations of the first aspect, before identifying the received first voice information entered by the user, further comprising: sending configuration request information to a server; and receiving initialization parameters sent by the server according to the configuration request information, wherein the initialization parameters comprise preset comment keywords.

In some implementation manners of the first aspect, the initialization parameter further includes a preset end keyword, and the method further includes: identifying the received third voice information input by the user to obtain a second identification result, wherein the second identification result comprises a second keyword; and when the second keyword is a preset finishing keyword, exiting the comment mode.

In some implementations of the first aspect, the initialization parameter further includes a preset cleaning keyword, and the method further includes: and when the second voice information contains the preset cleaning keyword, cleaning the second voice information.

In some implementations of the first aspect, recognizing the received first voice information entered by the user to obtain a first recognition result includes: obtaining a voice feature sequence according to the first voice information; and identifying the voice characteristic sequence to obtain a first identification result.

In some implementations of the first aspect, sending the second voice information, the user identifier, the client identifier, and the comment scene data to the server includes: performing noise reduction, echo cancellation and coding on the second voice information to obtain comment voice data; and sending the comment voice data, the user identification, the client identification and the comment scene data to a server.

In a second aspect, an AR-based voice commenting method is provided, which is used for a server, and includes: receiving authentication request information sent by a client, wherein the authentication request information comprises a user identifier, a client identifier and a scene identifier; authenticating the user, the client and the current scene according to the user identifier, the client identifier and the scene identifier to generate an authentication token; sending an authentication token to the client, wherein the authentication token is used for triggering the comment mode by the client; and receiving second voice information, a user identifier, a client identifier and comment scene data sent by the client, wherein the comment scene data comprise a current real object scene picture and one frame of a virtual image when the user starts to comment, and the second voice information is comment information of a scene corresponding to the comment scene data by the user.

In some implementations of the second aspect, before receiving the authentication request information sent by the client, the method further includes: receiving configuration request information sent by a client; and sending initialization parameters to the client according to the configuration request information, wherein the initialization parameters comprise preset comment keywords.

In some implementations of the second aspect, initializing the parameters further includes: and presetting an end keyword and a cleaning keyword.

In a third aspect, an AR-based voice commenting apparatus is provided for a client, the apparatus including: the recognition module is used for recognizing the received first voice information input by the user to obtain a first recognition result, and the first recognition result comprises a first keyword; the sending module is used for sending authentication request information to the server when the first keyword is a preset comment keyword, wherein the authentication request information comprises a user identifier, a client identifier and a scene identifier, and the authentication request information is used for the server to generate an authentication token according to the user identifier, the client identifier and the scene identifier; the receiving module is used for receiving the authentication token sent by the server; the triggering module is used for triggering the comment mode according to the first keyword and the authentication token and receiving second voice information input by the user; the sending module is further used for sending second voice information, a user identification, a client identification and comment scene data to the server, wherein the comment scene data comprise a current real object scene picture and one frame of a virtual image when the user starts to comment, and the second voice information is comment information of the user on a scene corresponding to the comment scene data.

In some realizations of the third aspect, before identifying the received first voice information entered by the user, the method further includes a transceiver module for sending configuration request information to the server; and receiving initialization parameters sent by the server according to the configuration request information, wherein the initialization parameters comprise preset comment keywords.

In some implementation manners of the third aspect, the initialization parameter further includes a preset ending keyword, and the recognition module is further configured to recognize the received third voice information entered by the user to obtain a second recognition result, where the second recognition result includes the second keyword; and when the second keyword is a preset ending keyword, exiting the comment mode.

In some implementation manners of the third aspect, the initialization parameter further includes a preset cleaning keyword, and the recognition module is further configured to clean the second voice message when the second voice message includes the preset cleaning keyword.

In some implementations of the third aspect, the recognition module is specifically configured to obtain a speech feature sequence according to the first speech information; and identifying the voice characteristic sequence to obtain a first identification result.

In some implementation manners of the third aspect, the sending module is specifically configured to perform noise reduction, echo cancellation, and encoding on the second voice information to obtain comment voice data; and sending the comment voice data, the user identification, the client identification and the comment scene data to a server.

In a fourth aspect, there is provided an AR-based voice commenting apparatus for a server, the apparatus including: the receiving module is used for receiving authentication request information sent by a client, wherein the authentication request information comprises a user identifier, a client identifier and a scene identifier; the authentication module is used for authenticating the user, the client and the current scene according to the user identifier, the client identifier and the scene identifier to generate an authentication token; the sending module is used for sending an authentication token to the client, and the authentication token is used for triggering the comment mode by the client; the receiving module is further used for receiving second voice information, a user identifier, a client identifier and comment scene data sent by the client, wherein the comment scene data comprise a current real object scene picture and a frame of a virtual image when the user starts to comment, and the second voice information is comment information of the user on a scene corresponding to the comment scene data.

In some realizations of the fourth aspect, before receiving the authentication request information sent by the client, the system further includes a configuration module, configured to receive the configuration request information sent by the client; and sending initialization parameters to the client according to the configuration request information, wherein the initialization parameters comprise preset comment keywords.

In some implementations of the fourth aspect, initializing the parameters further includes: and presetting an ending keyword and a cleaning keyword.

In a fifth aspect, there is provided an AR-based voice commenting device, including: a processor and a memory storing computer program instructions; the processor, when executing the computer program instructions, implements the method for AR-based voice critique in the first or second aspect, or in some realizations of the first or second aspect.

A sixth aspect provides a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the first or second aspect, or an AR-based voice commenting method in some realizations of the first or second aspect.

The invention relates to the field of digital services, in particular to an AR-based voice commenting method, an AR-based voice commenting device, AR-based voice commenting equipment and an AR-based voice commenting computer-readable storage medium, wherein a first recognition result is obtained by recognizing received first voice information input by a user and comprises a first keyword; when the first keyword is a preset comment keyword, sending authentication request information to a server, wherein the authentication request information is used for the server to generate an authentication token according to a user identifier, a client identifier and a scene identifier; receiving an authentication token sent by a server; triggering a comment mode according to the first key words and the authentication token, and receiving second voice information input by a user; the second voice information, the user identification, the client identification and the comment scene data are sent to the server, the comment scene data comprise a current real object scene picture and one frame of a virtual image when the user begins to comment, the second voice information is comment information of a scene corresponding to the comment scene data, a comment mode can be triggered by specific preset comment keywords, the user can realize voice comment on AR content only by sending the voice information without typing, the voice information cannot influence playing of the current AR content, the comment requirement of the user is met, and the use experience of AR application of the user is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flowchart of an AR-based voice commenting method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of an AR-based voice commenting device according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of another AR-based voice commenting apparatus according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an AR-based voice critique device according to an embodiment of the present invention.

Detailed Description

Features and exemplary embodiments of various aspects of the present invention will be described in detail below, and in order to make objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not to be construed as limiting the invention. It will be apparent to one skilled in the art that the present invention may be practiced without some of these specific details. The following description of the embodiments is merely intended to provide a better understanding of the present invention by illustrating examples of the present invention.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising 8230; \8230;" comprises 8230; "does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone.

The AR technology can be applied to many scenes, such as classroom education, traditional classrooms generally carry out infusion type teaching, matched courseware or auxiliary tools are lacked, students have low absorption efficiency in the boring and tasteless learning process, the AR technology is introduced, related virtual scenes are displayed in a triggering mode by combining book contents, complex and abstract concepts are displayed in a vivid mode, accordingly, the students are helped to strengthen knowledge cognition, interaction means are provided, enthusiasm of the students is mobilized, and learning efficiency is improved.

When a user watches the AR content, the user often needs to comment or remark the AR content in real time so as to store additional information or share information, for example, in classroom education, students can enter notes for the AR content so as to consolidate learning.

In the prior art, when the comment or remark requirement of the user is met, a text input mode is generally adopted, which is similar to a 'barrage' displayed on playing content by typing comments when a traditional video website watches movies, but the text input mode is only suitable for handheld devices such as mobile phones and is not suitable for AR glasses or head-mounted devices; on the other hand, when the AR content is commented, the efficiency of manual character input is low, the requirements of the user cannot be well met, and the use experience of the AR application of the user is reduced.

In order to solve the problems that the existing text commenting mode is not suitable for AR glasses and the commenting efficiency is low, the embodiment of the invention provides an AR-based voice commenting method, device, equipment and medium. The technical solutions of the embodiments of the present invention are described below with reference to the accompanying drawings.

Fig. 1 is a schematic flowchart of an AR-based voice commenting method according to an embodiment of the present invention, and as shown in fig. 1, the AR-based voice commenting method may include the following steps:

s101, the client identifies the received first voice information input by the user to obtain a first identification result, and the first identification result comprises a first keyword.

Specifically, a user can input first voice information through a microphone, a client receives the first voice information of the user through a monitoring microphone, a voice feature sequence changing along with time is extracted from voice waveform data of the first voice information, and then the voice feature sequence is identified to identify a first keyword in the first voice information.

The first voice information input by the user can be similar instructions such as "i want to comment", "begin to comment", "note", "remark", "record my idea", and the like, and the client recognizes the voice information to obtain a first recognition result including similar words such as "comment", "note", "remark", "record", "idea", and the like.

Before identifying the received first voice information input by the user, the client acquires configuration from the server, and specifically includes: the client sends configuration request information to the server, and receives initialization parameters sent by the server according to the configuration request information, wherein the initialization parameters comprise preset comment keywords.

Here, the comment keyword may be preset as a word related to the comment instruction, such as a related word or phrase, for example, "comment", "note", "remark", "record", "idea", and the like.

Specifically, in the AR application, the invocation of the client may be realized by a Software Development Kit (SDK).

S102, when the first keyword is a preset comment keyword, the client sends authentication request information to the server, wherein the authentication request information comprises a user identifier, a client identifier and a scene identifier.

Firstly, a client acquires a preset comment keyword in an initialization parameter, compares an acoustic model of the preset keyword with a first keyword, and judges whether the first keyword is consistent with the preset comment keyword.

And then, when the first keyword is consistent with the preset comment keyword, the client sends authentication request information to the server by calling the server-side interface, and requests the server to perform authentication and comment function authorization on the client.

S103, authenticating the user, the client and the current scene according to the user identifier, the client identifier and the scene identifier to generate an authentication token.

The server receives request authentication information sent by the client, checks the current user, the client and the scene according to the user identification, the client identification and the scene identification in the authentication request information, judges whether the current user and the client can perform voice comment in the current scene or not, and generates an authentication token (token) after the check is passed.

Before receiving the authentication request information sent by the client, the server further includes: receiving configuration request information sent by a client; and sending initialization parameters to the client according to the configuration request information, wherein the initialization parameters comprise preset comment keywords.

S104, the server sends the authentication token to the client.

And the server sends a token to the client to perform comment function authorization on the client.

And S105, the client triggers a comment mode according to the first keyword and the authentication token and receives second voice information input by the user.

And the client receives the token sent by the server, successfully passes the verification of the server and completes the authorization of the comment function. When the client receives the token sent by the server, the comment mode is triggered according to the first keyword and the token, the user can be prompted to enter the comment mode successfully, and second voice information input by the user is received.

The second voice information entered by the user may be information entered by the user through voice to comment or remark on the current AR scene or AR content.

As a specific example, in the AR application for classroom education, students wear AR glasses to view chemical molecular structures, and in an immersive teaching scene, the students recite "note taking", the client obtains authorization for a comment function from the server according to the recognized keyword "note", then triggers a recording (comment) mode according to token and keyword "note" returned by the server, and prompts the students that the "recording mode has been successfully entered", the students can start voice note recording of the chemical molecular structures after hearing the prompt tone, and the client stores voice information entered by the students.

And S106, the client sends the second voice information, the user identification, the client identification and the comment scene data to the server.

The client caches the received second voice information input by the user, and performs preprocessing such as noise reduction, echo cancellation and coding on the second voice information to obtain comment voice data; then, acquiring a current real object scene picture and one frame of a virtual image when a user starts to comment as comment scene data, wherein the second voice information is comment information of the user on a scene corresponding to the comment scene data; and finally, sending the comment voice data, the user identification, the client identification and the comment scene data to the server.

Optionally, in an embodiment, before the client sends the comment voice data, the user identifier, the client identifier, and the comment scene data to the server, the client may encapsulate the comment voice data, the user identifier, the client identifier, and the comment scene data, and send an obtained data encapsulation packet to the server through a Hypertext Transfer Protocol (HTTP).

Optionally, in one embodiment, the client may package and send the scene identifier and the authentication token to the server together with the comment voice data, the user identifier, the client identifier, and the comment scene data.

S107, the server stores the second voice information, the user identification, the client identification and the comment scene data.

When the server receives second voice information, user identification, client identification and comment scene data sent by the client, the received second voice information, user identification, client identification and comment scene data are stored, the second voice information and comment scene data are stored in a file system or a distributed system, a data storage path is obtained, and then the data storage path, the user identification and the client identification are stored in a database.

Specifically, the server may store the received comment voice data, the user identifier, the client identifier, the scene identifier, the authentication token, and the comment scene data sent by the client, store the comment voice data and the comment scene data in a file system or a distributed system to obtain a data storage path, and then store the data storage path, the user identifier, the client identifier, the scene identifier, and the authentication token together in the database.

The server can also provide visual management for the stored comment voice data.

Optionally, in an embodiment, the server may further parse the received data encapsulation packet sent by the client, and store the obtained parsed data.

In some embodiments, the initialization parameter may further include a preset cleaning keyword, the user may enter second voice information through a microphone, the client receives the second voice information of the user through a monitoring microphone, extracts a voice feature sequence that changes with time from voice waveform data of the second voice information, identifies the voice feature sequence, identifies a keyword in the second voice information, and cleans the second voice information when the second voice information includes the preset cleaning keyword.

The cleaning keyword may be preset as a word related to the indication of the cleaning comment information, such as a related word or phrase of "cleaning", "invalidation", and the like. The second voice information input by the user can be a piece of comment voice information containing similar instructions of 'clearing the comment, recording information invalidity' and the like, and the client identifies the voice information to obtain similar words of 'clearing', 'invalidity' and the like.

In some embodiments, the initialization parameter may further include a preset end keyword, the user may enter third voice information through a microphone, the client receives the third voice information of the user through a monitoring microphone, extracts a voice feature sequence that changes with time from voice waveform data of the third voice information, identifies the voice feature sequence, identifies a second keyword in the third voice information, and exits the comment mode when the second keyword is the preset end keyword or the user has no voice within a preset time (e.g., 5 seconds).

Here, the end keyword may be preset as a word related to the end comment indication, such as "complete", "end", or a related word or phrase. The third voice information input by the user may be similar indications such as "i want to end the comment", "i have completed the comment", and the client recognizes the voice information to obtain a second recognition result including similar words such as "end", "complete", and the like.

In some embodiments, initializing parameters may further include: maximum supported voice time, maximum voice comment number, voice data format, image format, data transmission timeout time and the like.

According to the AR-based voice commenting method, the first voice information input by the user is identified to obtain the first identification result comprising the first keyword, when the first keyword is the preset commenting keyword, the real-time voice commenting of the user on the current AR scene is realized by sending the authentication request information to authorize the commenting function and triggering the commenting mode, the voice information does not influence the playing of the current AR content, the commenting efficiency is improved, the commenting requirement of the user is met, and the use experience of AR application of the user is improved.

Fig. 2 is a schematic structural diagram of an AR-based voice commenting apparatus according to an embodiment of the present invention, which is used for a client, and as shown in fig. 2, the AR-based voice commenting apparatus 200 may include: the device comprises an identification module 201, a sending module 202, a receiving module 203 and a triggering module 204.

The recognition module 201 is configured to recognize received first voice information input by a user to obtain a first recognition result, where the first recognition result includes a first keyword.

The sending module 202 is configured to send authentication request information to the server when the first keyword is a preset comment keyword, where the authentication request information includes a user identifier, a client identifier, and a scene identifier, and the authentication request information is used by the server to generate an authentication token according to the user identifier, the client identifier, and the scene identifier.

The receiving module 203 is configured to receive the authentication token sent by the server.

And the triggering module 204 is used for triggering the comment mode according to the first keyword and the authentication token and receiving second voice information input by the user.

The sending module 202 is further configured to send, to the server, second voice information, a user identifier, a client identifier, and comment scene data, where the comment scene data includes a current real object scene picture and a frame of a virtual image when the user starts to comment, where the second voice information is comment information of a scene corresponding to the comment scene data by the user.

In some embodiments, before identifying the received first voice information input by the user, the system further comprises a transceiver module for sending configuration request information to the server; and receiving initialization parameters sent by the server according to the configuration request information, wherein the initialization parameters comprise preset comment keywords.

In some embodiments, the initialization parameter further includes a preset ending keyword, and the recognition module 201 is further configured to recognize the received third voice information input by the user to obtain a second recognition result, where the second recognition result includes the second keyword; and when the second keyword is a preset ending keyword, exiting the comment mode.

In some embodiments, the initialization parameter further includes a preset cleaning keyword, and the recognition module 201 is further configured to clean the second voice message when the second voice message includes the preset cleaning keyword.

In some embodiments, the recognition module 201 is specifically configured to obtain a speech feature sequence according to the first speech information; and identifying the voice characteristic sequence to obtain a first identification result.

In some embodiments, the sending module 202 is specifically configured to perform noise reduction, echo cancellation, and encoding on the second voice information to obtain comment voice data; and sending the comment voice data, the user identification, the client identification and the comment scene data to a server.

The AR-based voice commenting device is used for a client to obtain a first recognition result by recognizing received first voice information input by a user, wherein the first recognition result comprises a first keyword; when the first keyword is a preset comment keyword, sending authentication request information to a server; receiving an authentication token sent by a server; triggering a comment mode according to the first keyword and the authentication token, and receiving second voice information input by the user; the second voice information, the user identification, the client identification and the comment scene data are sent to the server, the comment mode can be triggered by utilizing the specific preset comment keyword, the user does not need to type, the voice comment of the AR content can be achieved only by sending the voice information, the voice information cannot influence the playing of the current AR content, the comment requirement of the user is met, and the use experience of the AR application of the user is improved.

It can be understood that the AR-based voice commenting apparatus 200 according to the embodiment of the present invention may correspond to a client executing the embodiment shown in fig. 1, and specific details of operations and/or functions of each module/unit of the AR-based voice commenting apparatus 200 may refer to descriptions of corresponding parts in the AR-based voice commenting method in the embodiment shown in fig. 1, and are not described herein again for brevity.

Fig. 3 is a schematic structural diagram of another AR-based voice commenting apparatus according to an embodiment of the present invention, which is used for a server, and as shown in fig. 3, the AR-based voice commenting apparatus 300 may include: a receiving module 301, an authentication module 302, and a transmitting module 303.

The receiving module 301 is configured to receive authentication request information sent by a client, where the authentication request information includes a user identifier, a client identifier, and a scene identifier.

And the authentication module 302 is configured to authenticate the user, the client and the current scene according to the user identifier, the client identifier and the scene identifier, and generate an authentication token.

A sending module 303, configured to send an authentication token to the client, where the authentication token is used for the client to trigger the comment mode.

The receiving module 301 is further configured to receive second voice information, a user identifier, a client identifier, and comment scene data sent by the client, where the comment scene data includes a current real object scene picture and a frame of a virtual image when the user starts to comment, where the second voice information is comment information of a scene corresponding to the comment scene data by the user.

In some embodiments, before receiving the authentication request information sent by the client, the system further comprises a configuration module, configured to receive the configuration request information sent by the client; and sending initialization parameters to the client according to the configuration request information, wherein the initialization parameters comprise preset comment keywords.

In some embodiments, initializing the parameters further comprises: and presetting an ending keyword and a cleaning keyword.

The AR-based voice comment device is used for a server, and receives authentication request information sent by a client, wherein the authentication request information comprises a user identifier, a client identifier and a scene identifier; authenticating the user, the client and the current scene according to the user identifier, the client identifier and the scene identifier to generate an authentication token; sending an authentication token to the client, wherein the authentication token is used for triggering the comment mode by the client; the second voice information, the user identification, the client identification and the comment scene data sent by the client are received, authorization of the client endpoint comment function can be completed, the user can successfully enter the voice comment mode and comment on the current AR scene, the comment requirement of the user is met, and the use experience of AR application of the user is improved.

It can be understood that the AR-based voice commenting apparatus 300 according to the embodiment of the present invention may correspond to a server executing the embodiment shown in fig. 1, and specific details of the operation and/or function of each module/unit of the AR-based voice commenting apparatus 300 may refer to the description of corresponding parts in the AR-based voice commenting method in the embodiment shown in fig. 1, and are not repeated herein for brevity.

Fig. 4 is a schematic diagram of a hardware structure of an AR-based voice commenting device according to an embodiment of the present invention.

As shown in fig. 4, the AR-based voice commenting device 400 in the present embodiment includes an input device 401, an input interface 402, a central processor 403, a memory 404, an output interface 405, and an output device 406. The input interface 402, the central processing unit 403, the memory 404, and the output interface 405 are connected to each other through a bus 410, and the input device 401 and the output device 406 are connected to the bus 410 through the input interface 402 and the output interface 405, respectively, and further connected to other components of the networking information determination device 400.

Specifically, the input device 401 receives input information from the outside and transmits the input information to the central processor 403 through the input interface 402; the central processor 403 processes the input information based on computer-executable instructions stored in the memory 404 to generate output information, stores the output information temporarily or permanently in the memory 404, and then transmits the output information to the output device 406 through the output interface 405; the output device 406 outputs the output information to the outside of the AR-based voice commenting device 400 for use by the user.

That is, the AR-based voice commenting apparatus shown in fig. 4 may also be implemented to include: a memory storing computer-executable instructions; and a processor which, when executing the computer executable instructions, may implement the AR-based voice commenting method described in connection with the example shown in fig. 1.

In one embodiment, the AR-based voice commenting apparatus 400 illustrated in fig. 4 includes: a memory 404 for storing programs; the processor 403 is configured to execute the program stored in the memory to perform the method of the embodiment shown in fig. 1 according to the embodiment of the present invention.

An embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium has computer program instructions stored thereon; which when executed by a processor implement the method of the embodiment shown in fig. 1 provided by an embodiment of the present invention.

It is to be understood that the invention is not limited to the precise arrangements and instrumentalities shown. A detailed description of known methods is omitted herein for the sake of brevity. In the above embodiments, several specific steps are described and shown as examples. However, the method processes of the present invention are not limited to the specific steps described and illustrated, and those skilled in the art can make various changes, modifications and additions or change the order between the steps after comprehending the spirit of the present invention.

The functional blocks shown in the above-described structural block diagrams may be implemented as hardware, software, firmware, or a combination thereof. When implemented in hardware, it may be, for example, an electronic Circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, plug-in, function card, or the like. When implemented in software, the elements of the invention are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine-readable medium or transmitted by a data signal carried in a carrier wave over a transmission medium or a communication link. A "machine-readable medium" may include any medium that can store or transfer information. Examples of a machine-readable medium include electronic circuits, semiconductor Memory devices, read-Only memories (ROMs), flash memories, erasable ROMs (EROMs), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, radio Frequency (RF) links, and so forth. The code segments may be downloaded via computer networks such as the internet, intranet, etc.

It should also be noted that the exemplary embodiments noted in this patent describe some methods or systems based on a series of steps or devices. However, the present invention is not limited to the order of the above-described steps, that is, the steps may be performed in the order mentioned in the embodiments, may be performed in an order different from the order in the embodiments, or may be performed simultaneously.

As described above, only the specific embodiments of the present invention are provided, and it can be clearly understood by those skilled in the art that, for convenience and simplicity of description, the specific working processes of the system, the module and the unit described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. It should be understood that the scope of the present invention is not limited thereto, and any equivalent modifications or substitutions can be easily made by those skilled in the art within the technical scope of the present invention.

Claims

1. An AR-based voice commenting method, which is used for a client, and comprises the following steps:

identifying received first voice information input by a user to obtain a first identification result, wherein the first identification result comprises a first keyword;

when the first keyword is a preset comment keyword, sending authentication request information to a server, wherein the authentication request information comprises a user identifier, a client identifier and a scene identifier, and the authentication request information is used for the server to generate an authentication token according to the user identifier, the client identifier and the scene identifier;

receiving the authentication token sent by the server;

triggering a comment mode according to the first keyword and the authentication token, and receiving second voice information input by the user;

and sending the second voice information, the user identifier, the client identifier and comment scene data to the server, wherein the comment scene data comprises a current real object scene picture and a frame of a virtual image when the user starts to comment, and the second voice information is comment information of the user on a scene corresponding to the comment scene data.

2. The method of claim 1, wherein prior to the identifying the received first voice information entered by the user, the method further comprises:

sending configuration request information to the server;

and receiving initialization parameters sent by the server according to the configuration request information, wherein the initialization parameters comprise the preset comment keywords.

3. The method of claim 2, wherein the initialization parameters further include a preset end keyword, the method further comprising:

identifying the received third voice information input by the user to obtain a second identification result, wherein the second identification result comprises a second keyword;

and when the second keyword is a preset ending keyword, exiting the commenting mode.

4. The method of claim 2, wherein the initialization parameters further include a preset cleaning keyword, the method further comprising:

and when the second voice information contains the preset cleaning keywords, cleaning the second voice information.

5. The method according to claim 1, wherein the recognizing the received first voice information entered by the user to obtain a first recognition result comprises:

obtaining a voice feature sequence according to the first voice information;

and identifying the voice feature sequence to obtain the first identification result.

6. The method of claim 1, wherein the sending the second voice information, the user identifier, the client identifier, and comment context data to the server comprises:

performing noise reduction, echo cancellation and coding on the second voice information to obtain comment voice data;

and sending the comment voice data, the user identification, the client identification and the comment scene data to the server.

7. An AR-based voice commenting method, wherein the method is used for a server, and the method comprises the following steps:

receiving authentication request information sent by a client, wherein the authentication request information comprises a user identifier, a client identifier and a scene identifier;

authenticating the user, the client and the current scene according to the user identifier, the client identifier and the scene identifier to generate an authentication token;

sending the authentication token to the client, wherein the authentication token is used for triggering a comment mode by the client;

and receiving second voice information, the user identifier, the client identifier and comment scene data sent by the client, wherein the comment scene data comprises a current real object scene picture and a frame of a virtual image when the user starts to comment, and the second voice information is comment information of the user on a scene corresponding to the comment scene data.

8. The method according to claim 7, wherein before the receiving the authentication request information sent by the client, the method further comprises:

receiving configuration request information sent by the client;

and sending initialization parameters to the client according to the configuration request information, wherein the initialization parameters comprise preset comment keywords.

9. The method of claim 8, wherein the initializing parameters further comprises: and presetting an end keyword and a cleaning keyword.

10. An AR-based voice commenting device, wherein the device is used for a client, the device comprising:

the recognition module is used for recognizing received first voice information input by a user to obtain a first recognition result, and the first recognition result comprises a first keyword;

the sending module is used for sending authentication request information to a server when the first keyword is a preset comment keyword, wherein the authentication request information comprises a user identifier, a client identifier and a scene identifier, and the authentication request information is used for the server to generate an authentication token according to the user identifier, the client identifier and the scene identifier;

a receiving module, configured to receive the authentication token sent by the server;

the triggering module is used for triggering a comment mode according to the first keyword and the authentication token and receiving second voice information input by the user;

the sending module is further configured to send the second voice information, the user identifier, the client identifier and comment scene data to the server, where the comment scene data includes a current real object scene picture and a frame of a virtual image when the user starts to comment, and the second voice information is comment information of a scene corresponding to the comment scene data by the user.

11. An AR-based voice commenting device, wherein the device is used for a server, the device comprising:

the system comprises a receiving module, a judging module and a sending module, wherein the receiving module is used for receiving authentication request information sent by a client, and the authentication request information comprises a user identifier, a client identifier and a scene identifier;

the authentication module is used for authenticating the user, the client and the current scene according to the user identifier, the client identifier and the scene identifier to generate an authentication token;

the sending module is used for sending the authentication token to the client, and the authentication token is used for triggering the comment mode by the client;

the receiving module is further configured to receive second voice information, the user identifier, the client identifier and comment scene data sent by the client, where the comment scene data includes a current real object scene picture and a frame of a virtual image when the user starts to comment, and the second voice information is comment information of a scene corresponding to the comment scene data by the user.

12. An AR-based voice commenting device, comprising: a processor and a memory storing computer program instructions;

the processor, when executing the computer instructions, implements the AR-based voice commenting method according to any one of claims 1 to 9.

13. A computer-readable storage medium, wherein computer program instructions are stored thereon, which when executed by a processor, implement the AR-based voice commenting method according to any one of claims 1 to 9.