CN116300092B

CN116300092B - Control method, device and equipment of intelligent glasses and storage medium

Info

Publication number: CN116300092B
Application number: CN202310248297.2A
Authority: CN
Inventors: 刘俊启
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-03-09
Filing date: 2023-03-09
Publication date: 2024-05-14
Anticipated expiration: 2043-03-09
Also published as: CN116300092A

Abstract

The disclosure provides a control method, a device, equipment and a storage medium for intelligent glasses, relates to the technical field of image processing, and particularly relates to the technical fields of artificial intelligence, voice technology, intelligent search and the like. The specific implementation scheme is as follows: collecting images in real time through cameras of the intelligent glasses, and storing the images into an image sequence; acquiring the moving state of the intelligent glasses in response to the acquired voice information; selecting a target image of interest to the user from the sequence of images based on the speech information and the movement state; obtaining marks of candidate objects in a target image; displaying the mark of each candidate object in the target image; and performing voice interaction with the user based on the marks of the candidate objects to obtain target objects interested by the user in the candidate objects. According to the embodiment of the disclosure, the purpose of identifying the refined image can be achieved based on the voice information of the user, so that the complexity and the operation cost of searching the intelligent glasses by the user are reduced, and the use experience of the user is improved.

Description

Control method, device and equipment of intelligent glasses and storage medium

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to the technical fields of artificial intelligence, speech technology, intelligent search, and the like.

Background

In recent years, with the development of artificial intelligence, deep learning and other technologies, more and more intelligent devices are appearing in people's lives. The smart glasses are similar to a smart phone and have independent operating systems. The appearance of intelligent glasses brings the change for our life, can use intelligent glasses to carry out functions such as follow-up, navigation, etc. However, the smart glasses cannot be operated by touch screen like the smart phone, and how to control the smart glasses is a concern in the industry.

Disclosure of Invention

The disclosure provides a control method, a device, equipment and a storage medium of intelligent glasses.

According to an aspect of the present disclosure, there is provided a control method of smart glasses, including:

Collecting images in real time through cameras of the intelligent glasses, and storing the images into an image sequence;

acquiring the moving state of the intelligent glasses in response to the acquired voice information;

selecting a target image of interest to the user from the sequence of images based on the speech information and the movement state;

Obtaining marks of candidate objects in a target image;

displaying the mark of each candidate object in the target image;

And performing voice interaction with the user based on the marks of the candidate objects to obtain target objects interested by the user in the candidate objects.

According to another aspect of the present disclosure, there is provided a control device for smart glasses, including:

the image acquisition module is used for acquiring images in real time through cameras of the intelligent glasses and storing the images into an image sequence;

the state processing module is used for responding to the collected voice information and obtaining the moving state of the intelligent glasses;

a first determining module for selecting a target image of interest to the user from the image sequence based on the voice information and the movement state;

The acquisition module is used for acquiring the marks of the candidate objects in the target image;

The display module is used for displaying the marks of the candidate objects in the target image;

and the second determining module is used for carrying out voice interaction with the user based on the marks of the candidate objects to obtain target objects interested by the user in the candidate objects.

According to another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform a method according to any one of the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method according to any of the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a wearable device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing a method according to any of the embodiments of the present disclosure when the computer program is executed.

In the embodiment of the disclosure, the user intention can be analyzed based on the voice interaction and the movement state of the intelligent glasses, and the target image interested by the user is screened out from the image sequence and displayed for the user interaction. And the image content of interest of the user is accurately positioned based on the marks, so that the user can search the image content on the intelligent glasses with lower complexity and operation cost, and the use experience of the user is improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow chart of a method of controlling smart glasses according to an embodiment of the present disclosure;

FIG. 2a is a schematic illustration of a target image according to another embodiment of the present disclosure;

FIG. 2b is a schematic illustration of a target image according to another embodiment of the present disclosure;

FIG. 3 is a schematic illustration of a target image according to another embodiment of the present disclosure;

FIG. 4a is a schematic illustration of a target image according to another embodiment of the present disclosure;

FIG. 4b is a schematic illustration of a target image according to another embodiment of the present disclosure;

FIG. 5 is a schematic illustration of a target image according to another embodiment of the present disclosure;

FIG. 6a is a schematic diagram of a display target image according to an embodiment of the present disclosure;

FIG. 6b is a schematic diagram of a display target image according to another embodiment of the present disclosure;

fig. 7 is a schematic structural view of a control device of smart glasses according to an embodiment of the present disclosure;

fig. 8 is a block diagram of an electronic device for implementing a control method of smart glasses according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The intelligent glasses are used as wearable intelligent equipment, so that richer living contents are provided for people. However, unlike smart glasses and smart phones, it is difficult for the smart glasses to interact with each other in a touch screen manner as in smart phones. This is because the image is acquired in real time during the use of the smart glasses by the user, and the image content varies as the acquired object varies. If a user is interested in a certain picture, it is basically difficult to select an image of interest by means of a touch screen, and it is also difficult to perform corresponding operations on the image of interest.

In view of this, the embodiment of the disclosure provides a control method for an intelligent glasses, which can initially locate an image of interest to a user, and on the basis, consider the interaction characteristics of the intelligent glasses, and provide a solution for how to conveniently and accurately interact with the user on the intelligent glasses to obtain local image content of interest to the user. As shown in fig. 1, a flow chart of a control method of intelligent glasses according to an embodiment of the disclosure includes the following:

s101, acquiring images in real time through cameras of the intelligent glasses, and storing the images in an image sequence.

In some embodiments, the collected images may be stored in a classified manner based on camera position and pose changes, and multiple frames of images at the same position and capture pose may be stored together to make the images in the same view range an image set. For example, as shown in fig. 2a, the image sequence may comprise a plurality of image sets. The image in the view range of fig. 2a, which includes a tree, two grass, a cat, a rabbit and two pedestrians, can be saved to an image set 1. In the case of a user turning his head to look at the sideways tree, the images in this position and shooting pose are saved into another image set 2, the images in this view range including two trees, four bushes and one monkey. When the images are stored, the images in each image set are sequentially stored according to the acquisition time points. The different image sets are also stored sequentially according to the acquisition time points, thereby forming an image sequence. In the embodiment of the disclosure, the images are classified and stored based on the positions and the shooting postures, and the sensors arranged in the intelligent glasses can conveniently, accurately and with low power consumption acquire the positions and the postures, so that the mode can be used for efficiently and quickly classifying and storing the images, and has practical application significance on the corresponding intelligent glasses.

In other embodiments, the image sequence may also be data obtained by sorting images acquired over a period of time according to the acquisition time. For example, a multi-frame image may be stored into an image sequence corresponding to a preset time period based on the acquisition time point of the image. For example, in the case that the preset time period is 10s, the images acquired within 1-10s are stored in the image sequence, and then the image sequence is updated based on the acquired images, so that the images within 10s are ensured to be stored in the image sequence. For example, each image frame is acquired, the image with the earliest acquisition time in the image sequence is deleted, and the latest acquired image is stored in the image sequence. In addition, in order to preserve the key images within a period of time, the image content can be analyzed, and the image with the larger change of the image content can be stored in the image sequence only when the image of the last frame is relatively stored. In this way, key frames can be stored with limited storage space to facilitate providing data for later user interactions.

It should be noted that, in the embodiment of the present disclosure, how to construct the image sequence is not limited, and the images collected by the smart glasses may be stored in the image sequence and used for later interaction, which is applicable to the embodiment of the present disclosure.

When the image is stored in the image sequence, the image tag and the image are required to be stored in the image sequence in a one-to-one correspondence, as shown in fig. 2b, wherein, the image 1 is the image tag, 03.05.10.06.01 is the acquisition time, and the time represents 03 month, 05 day, 10 point, 06 minutes and 01 seconds.

S102, responding to the collected voice information, and acquiring the moving state of the intelligent glasses.

The voice information is of a user wearing the intelligent glasses. The voice information may include wake-up words that trigger control functions provided by examples of the present disclosure.

In other embodiments, the voice information may also be audio that expresses a user's interest in the surroundings. For example, "i like that cat next". In short, the voice information is used to indicate that the user has a need to search for image content.

The moving state can be obtained based on a gyroscope built in the intelligent glasses, and the gyroscope has the characteristics of small volume and low power consumption and is suitable for obtaining the moving state of the intelligent glasses.

S103, selecting a target image of interest to the user from the image sequence based on the voice information and the movement state.

Wherein the user's voice information can reflect the user's intent. The moving state of the intelligent glasses can reflect the intention of the user to a certain extent and reflect whether the interest points of the user change to a certain extent. Therefore, the image interested by the user can be comprehensively judged based on the voice information and the moving state of the intelligent glasses.

The smart glasses collect images within a certain area, and when a user wears the smart glasses to generate interest in the surrounding environment, the user often generates interest in one or more local details. Thus, in order to facilitate accurate location of the user' S point of interest, in the disclosed embodiment, in S104, a marker of each candidate object in the target image is acquired.

The candidate is a local image in the target image, as shown in fig. 2a, in which the tree is one candidate and the kitten is another candidate.

In implementation, the intelligent glasses can perform target detection on the target image to obtain each candidate object in the target image, and mark each candidate object according to a target detection result.

Of course, the intelligent glasses can also send the image to the cloud server, and the cloud server carries out target detection on the image to obtain the mark of each candidate object.

In order to facilitate determination of a target object of interest, a marker of a candidate object may be quickly obtained, and in the embodiment of the present disclosure, target detection may be performed on images in an image sequence in real time, and the candidate object of each image and its corresponding marker may be stored. In this way, after the target image is determined, the markers of the respective candidate objects in the target object can be obtained directly from the stored contents.

After obtaining the markers of each candidate object, in order to facilitate accurate positioning of the content actually of interest to the user, in S105, the markers of each candidate object are displayed in the target image.

That is, in the target image, each candidate object is marked, and the user can distinguish different candidate objects through the marking.

S106, performing voice interaction with the user based on the marks of the candidate objects to obtain target objects interested by the user in the candidate objects.

In the embodiment of the disclosure, based on the characteristic that the intelligent glasses need to acquire images in real time and the image content can change in real time, the images potentially needed to be used are stored based on the image sequence. On the basis, the user intention can be analyzed based on voice interaction and the moving state of the intelligent glasses, and target images interested by the user are screened out from the image sequence and displayed for the user interaction. In order to accurately locate the local image content of interest of the user, the marks of the candidate objects in the target image are synchronously displayed while the target image is displayed, so that the user can conveniently distinguish different candidate objects, and the target object of actual interest of the user can be determined through voice interaction between the marks and the user. Therefore, according to the whole processing flow of the embodiment of the disclosure, the target image can be accurately positioned according to the use characteristics of the intelligent glasses, and the image content interested by the user is accurately positioned based on the marks, so that the user can search the image content on the intelligent glasses with lower complexity and operation cost, and the use experience of the user is improved.

Since the smart glasses may face various use cases, several important aspects of the control method of the smart glasses provided by the embodiments of the present disclosure are described below:

1) Opening voice interaction function

When a user wears the intelligent glasses, voice interaction is not necessarily needed in real time, and if the voice interaction function is always started, high power consumption can be caused. In view of this, in the embodiment of the present disclosure, the voice interaction function under the control method of the smart glasses provided in the embodiment of the present disclosure may be appropriately turned on and off.

For example, in some possible embodiments, the movement state of the smart glasses may be monitored, and the voice interaction function is automatically turned on if the movement speed of the smart glasses is below a speed threshold. That is, if the speed is reduced during the movement of the user, the user may be interested in the surroundings, so that in the case of slow movement of the smart glasses, the voice interaction function may be started, so that the target image is determined based on the voice information and the movement state of the smart glasses. Therefore, the power consumption of the intelligent glasses can be reduced, and the requirement of searching the image content by a user can be met.

Of course, correspondingly, when the moving speed of the intelligent glasses is higher than the preset speed, the voice interaction function can be automatically closed. The user can be inquired whether to close the function of searching the image content or not under the condition that the moving speed of the intelligent glasses is higher than the preset speed, and the voice interaction function is closed under the condition that the closing is confirmed.

2) Determining target images of interest to a user

In some embodiments, when a user uses smart glasses, the user typically stops first if the user is interested in something in the surrounding environment. This condition may be determined by detecting the movement state of the smart glasses, which is generally indicated as being in a stationary state. At this time, the last frame of image acquired by the camera of the intelligent glasses in real time can be used as the target image. However, considering that things in the surrounding environment include static things and possibly things that will move, in this embodiment of the disclosure, the last frame image that is acquired is selected as the target image only when it is determined that the image content in the acquisition range is substantially unchanged through image content analysis. Specifically, it may be implemented as follows:

Step A1: in the case where the moving state indicates that the smart glasses remain stationary during the acquisition of the voice information, the amount of change in the image content in the image sequence is determined.

Step A2: and under the condition that the variation of the image content is smaller than a preset threshold value, acquiring the last frame of image in the image sequence to obtain the target image.

Thus, the amount of change in image content in the image sequence over the duration of the user's voice input is determined based on the duration. The amount of change in the image content may be an amount of positional shift of a plurality of objects in the image. Under the condition that the variation of the image content is smaller than a preset threshold value, under the condition that the offset of the position of each object in the image is smaller than the preset threshold value in the process of voice input of a user, acquiring the last frame of image in the image sequence, namely the image closest to the end of voice input of the user in time, and determining the image as a target image so as to ensure that the image output to the user accords with the current actual condition.

In the embodiment of the disclosure, whether the user is interested in the image is determined based on the moving state of the intelligent glasses, and then the target image is determined based on the voice information of the user, so that the image interested in the user can be determined more efficiently and accurately based on the mode.

In some embodiments, when the variation of the image content is greater than or equal to the preset threshold, it is indicated that the image content in the acquisition range is changed, and the user generally sends out voice information after finding the object of interest, so, in order to ensure that the image of interest of the user can be shown, in the embodiments of the disclosure, the image corresponding to the voice information is selected from the image sequence based on the voice information, so as to obtain the target image.

In the embodiment of the disclosure, when the user expresses interest to the object in the surrounding environment, but the acquired image content changes, so that the object of interest is possibly not in the newly acquired image, the image when the user expresses interest can be determined based on the image sequence of the stored image, so that the user intention can be accurately understood, and the candidate object of interest to the user can be displayed as far as possible.

For example, with the user's input voice being "what the cat is" it may be that the cat runs off when the user speaks because the cat is dynamic. In this case, the amount of change in the image content may be caused to be greater than or equal to a preset threshold. In this regard, the images in the image sequence may be stored in a sorted manner based on the methods set forth above. As shown in fig. 2b, the image set 1 includes n images, and it can be seen from the figure that the kitten has run away in case the user stops the voice input. In this case, semantic recognition can be performed based on the voice information of the user, and the semantic recognition result is obtained, that is, the user is aware of the cat's interest. So that an image associated with the cat can be selected from the sequence of images and determined to be the target image. The semantic recognition may use natural language processing technology (Natural Language Processing, NLP), or other semantic recognition technology, which is not limited in this disclosure.

The image sequence may be uploaded to the cloud for storage, or may be stored locally. Under the condition of being stored to the cloud, the corresponding relation between the serial numbers of the intelligent glasses and the serial numbers of the image sequences collected by the intelligent glasses can be stored, and the method is not limited by the disclosure.

In some embodiments, while the user is speaking voice information, the user may continue to move or swing the head angle, at which time the focus of the user's attention is changing, and the user's intent may change with the change in focus. At the same time, the content of the voice information expressed by the user may also be changed with the change of the focus. Thus, it is possible to emphasize voice information with explicit interactive intention to determine a target image of interest to the user. When the movement state indicates that the position or posture of the smart glasses is changed, an image corresponding to the voice information is selected from the image sequence based on the voice information, and a target image is obtained.

In the embodiment of the disclosure, in the case that the position or posture of the smart glasses is changed, the focus of attention of the user may be changed. Further, the target image is determined based on the voice information, and the mode can focus on the explicit intention of the voice expression of the user, so that the image of interest of the user is determined.

For example, as the user continues to move, the head may be swung to see the surroundings, in which case the state of the gyroscope in the smart glasses may change. In view of this, in one possible implementation, when it is detected that the movement speed of the smart glasses is lower than the preset threshold, shooting is performed based on a preset time interval, and when the user stops inputting voice, shooting is stopped. In this case, semantic recognition can be performed based on the voice information of the user, and a semantic recognition result is obtained. Thus, an image sequence corresponding to the voice information can be selected, a plurality of images in the image sequence are detected based on the image quality evaluation method, a clear image is determined as a candidate image, and the candidate image is determined as a target image when the candidate image corresponds to the voice information.

In some embodiments, as set forth above, the focus may change during the user's voice message, at which time it is possible to accurately understand whether the user's intent has changed through the voice message. For example, in the case where the voice information includes one voice instruction, indicating that the user does not explicitly indicate interest in other things, the start time point of the one voice instruction may be determined; and selecting an image with the acquisition time point closest to the starting time point of the voice command from the image sequence to obtain a target image.

In the embodiment of the disclosure, there is a time difference between the start time point and the end time point of the voice information. This time difference is due to the fact that the change in position or posture of the smart glasses will cause a gap between the newly acquired image content and the image content of interest to the user. Therefore, when the voice information of the user only includes one voice command, the image of the shooting time closest to the starting time point is determined based on the starting time point of the voice command, and further, the image information including the object of interest in the voice of the user can be accurately acquired.

In some embodiments, take the user's voice information as "what the kitten likes to eat" as an example. At a first moment when a user starts voice input, a target moment closest to the first moment is screened from shooting moments in an image sequence, and an image corresponding to the target moment is taken as a candidate image. The candidate image is taken as the target image in the case that the candidate image is clear (i.e., can meet the image quality requirements) and can contain semantic information of the user (i.e., cat). In case the candidate image is unclear or cannot contain the semantic information of the user, then the image satisfying the requirements is rescreened from the image sequence based on the semantic information of the user.

In some embodiments, the user may express interest in different things in a short period of time, for example, the user utters the voice "you see, the kitten is really lovely", and at the next time, the user sees a beautiful flower, and says "that flower is really beautiful". This results in the voice message including a plurality of voice commands. In the case that the voice information includes a plurality of voice instructions, determining a start time point of the last voice instruction; and selecting an image with the acquisition time point closest to the starting time point of the last voice instruction from the image sequence to obtain a target image.

In the embodiment of the disclosure, under the condition that the intention of the user can be changed, the object of the user which is finally interested by the user can be accurately expressed by selecting the last voice, so that the screened target image can meet the use requirement of the user.

In some embodiments, in the scenario of the user and the peer going out for play, the voice information of the user is "where to eat in the noon today I or me, the very good taste of the XX restaurant, but not the very good taste of the restaurant today. You see that there is a kitten "for example. As can be seen from the voice of the user, "today I or me..see" the sentences are the voice of the user and the voice of the user's peer, and the last sentence is the voice information of the user on the smart glasses. And screening a target moment closest to the first moment from shooting moments in the image sequence based on the first moment input from the last sentence, matching the target moment with a collection time point corresponding to the image, and screening the image corresponding to the target moment to serve as a candidate image. In the case where the candidate image satisfies the image quality requirement, the candidate image is taken as the target image. In case the candidate image does not meet the image quality requirement, then rescreening from the image sequence is based on semantic information of the user.

Of course, in other embodiments, in order to improve the use experience of the user on the smart glasses, that is, in order to accurately identify the target object of interest to the user, an image quality evaluation needs to be performed for each selected frame of the target image, so as to ensure that the output target image can meet the image quality requirement.

The image quality requirement can be that illumination and brightness of the target image are parameters of the candidate objects in the target image which can be identified by the intelligent glasses.

In the embodiment of the disclosure, the image quality of the target image is required to meet the image quality requirement, so that the intelligent glasses can accurately identify the target object of interest to the user, and the user can watch clear and interactive images conveniently.

And under the condition that the image quality is determined to be unable to meet the image quality requirement, prompt information can be output so as to prompt a user to re-acquire the image or search the image content of other images.

In some embodiments, in the event that the environment in which the target image is detected is overnight and the objects in the image cannot be seen clearly, the user may be prompted to "the environment is too low in brightness". In this case, the user may choose to adjust the parameter information of the smart glasses so that the target image satisfies the image quality requirement.

In the embodiment of the disclosure, through the prompt, the user can know the image quality condition so as to facilitate better interaction with the user and complete the image content search.

2) Determining a marker for a candidate object

In some embodiments, for each candidate, the labeling of the candidate includes at least one of: the category of the candidate object, the position information of the candidate object in the target image, text information included in the candidate object, and the extension information of the candidate object.

In the process of image capturing, each image can be detected in real time, and image information in each image is obtained. The detection mode may use a target detection algorithm, as shown in fig. 3, the category of each candidate object in the image is tree, grass, person, cat and billboard, each dotted line box is the position information of each candidate object, and the text information included in the candidate objects may be the advertisement content on the billboard. Under the condition of excessive categories in the target image, the expansion information of the candidate object can be displayed in a popup window mode. The tree type corresponding to the candidate belonging to the tree type may be expansion information of the candidate, and the color, type, preference, feeding knowledge, and the like of the kittens in the candidate belonging to the kittens may be used as expansion information. Even links to some goods can be used as extension information.

In the embodiment of the disclosure, the category of the candidate object, the position information of the candidate object in the target image, the text information included in the candidate object and the expansion information of the candidate object are displayed in the image in the form of marks, so that a user can know the content of the candidate object in more detail, the user can know the interested object by himself, and the user experience is improved.

3) Determining a target object

In some embodiments, the determination of the target object based on the neural network model may be implemented as: inputting a voice instruction in the voice information into a target neural network model to obtain a candidate object matched with the voice instruction; and taking the candidate object matched with the voice instruction as a target object.

In one embodiment, a portion of the voice instructions may be acquired as a training set and a test set, with a target object matched with the voice instructions in the training set as a reference object. Inputting a voice instruction into the initial neural network model, acquiring a first object, training the initial neural network model based on difference information of the first object and a reference object, and obtaining a target neural network model under the condition of training convergence. And then the target neural network model can be tested by adopting the test set.

In the embodiment of the disclosure, the matching of the labels of the candidate objects with the voice information is realized based on the neural network model, and the neural network model has strong self-learning function, so that the neural network model has universality. However, the neural network model has higher requirements on hardware and calculation force, and the intelligent glasses have smaller volume and limited calculation force, so the mode of determining the target object by using the neural network model has higher requirements on the hardware of the intelligent glasses.

In view of this, in order to save hardware and computational resources of the smart glasses, voice interaction may be performed with the user based on the labels of each candidate object, so as to obtain a target object of interest to the user in each candidate object, which may be implemented as follows:

Step B1: and acquiring target keywords in the voice instruction which needs to be matched with the target object.

In practice, the voice command may refer to a voice command in the user's voice in case the user's voice information explicitly expresses the object of interest. In the case that the voice information of the user includes a voice instruction, the voice instruction is regarded as a voice instruction which needs to be matched with the target object. In the case where the voice information of the user includes a plurality of voice instructions, the last voice instruction may be selected as the voice instruction that needs to match the target object.

Under the condition that the voice information of the user is only wake-up image content searching function and the interested object is not explicitly expressed, the user can interact with the user after the target image marked with the candidate object is displayed, a voice instruction is obtained, and a target keyword in the voice instruction is determined.

The method for obtaining the target keyword can use word segmentation operation. The word segmentation operation can be performed on the voice command by using a forward maximum matching method (forward maximum matching method, FMM), and the first i characters in the current character string in the voice command are used as candidate keywords for searching the word segmentation dictionary on the assumption that the longest word in the word segmentation dictionary has i Chinese characters (i is more than or equal to 1). If the dictionary has the words of the same i characters as the candidate keywords, the candidate keywords are successfully matched, and the candidate keywords are cut out as a word. If the word with i characters cannot be found in the dictionary, the matching fails, the last word in the candidate keywords is removed, and the matching processing is carried out again on the rest characters, so that the matching is carried out until the whole voice instruction is successfully matched, namely the length of the word is cut or the length of the rest word strings is zero. Based on the word segmentation operation, the type of each keyword can be marked, a word characteristic marking method (part-of-SPEECH TAGGING) can be used for marking the type of each keyword, and the type of each keyword can be accurately marked, wherein the type comprises a name, a place name, an adjective, a verb, an adjective and the like. Wherein, candidate keywords with parts of speech as nouns or numbers are screened as target keywords.

The target keywords may also be determined using Named Entity Recognition (NER) approach. It should be noted that any manner of recognizing keywords in a voice command may be applied to the embodiments of the present disclosure.

Step B2: and matching the target keywords with keywords in the marks of the candidate objects.

Step B3: and obtaining a candidate object corresponding to the mark matched with the target keyword to obtain a target object.

In some embodiments, as shown in fig. 3, the voice command of the user is "that kitten is lovely", and the target keyword in the voice command is a kitten. And matching the target keyword with keywords in the marks of the candidate objects, so that the candidate object with the category of kittens in the target image can be obtained, and the candidate object is taken as the target object. The target object may be specially processed, as shown in fig. 3, in the form of a double-layer dashed box. Wherein the special treatment may also mark the target object in the form of a frame of light or other colors. To ensure that the user can see the target object in time. In summary, any manner of highlighting a target object of interest to a user is applicable to embodiments of the present disclosure.

In another embodiment, the user may also use the number of the candidate object as a keyword, and the target image is shown in fig. 3, and takes the voice command of the user as "i want to know the type 4" as an example, and the target keyword in the voice command is 4. And carrying out matching operation on the target keyword and keywords in the marks of the candidate objects, so that the candidate object with the number of 4 in the target image can be obtained, the candidate object is taken as the target object, and special treatment is carried out on the candidate object, for example, a convex display mode is adopted for displaying, so as to show that the target object is selected.

In the embodiment of the disclosure, the target object is determined by matching the keywords in the voice instruction with the keywords in the mark, so that the fine identification of the image content can be realized, and the method is simple, convenient and efficient and has low requirements on hardware and calculation power.

4) Performing subsequent processing on target object based on voice instruction of user

In some embodiments, after determining the target object, an operation corresponding to the image processing instruction may be performed on the target object in response to the image processing instruction acquired based on the voice interaction function.

The image processing instruction can be copy, paste, collect, scratch and the like.

For example, when the voice of the user is "copy the picture of the kitten," the smart glasses process the target image based on the voice as a copy.

Wherein the voice command may also be for solving the expanded information of the target object. The form of acquiring the extension information may be determined by means of voice. For example, if the input voice of the user is "i want to see the extension information No. 1", the extension information No.1 may be displayed in a popup window. As shown in the left side of fig. 4a, no.1 is a target object of tree type, and when user voice is detected, the expansion information is displayed in a popup window form. When the user's "close No.1 extension information" is heard, the extension information is collapsed as shown on the right side of fig. 4 a. Of course, the extension information of a plurality of candidates may be selected, or the extension information of a plurality of candidates may be stored.

The image processing instruction may also be zoom in, zoom out, or the like. For example, in the case where the voice instruction of the user is "zoom-in cat", the smart glasses perform the zoom-in processing on the target image of the category cat based on the voice.

In the embodiment of the disclosure, the processing of the target object can be realized based on the requirement of the user, the processing of the target image can be realized quickly by combining the voice mode, the complexity of the user in operation is reduced, and the user experience is improved.

In some embodiments, in the case that the target object is acquired but the voice instruction of the user does not include the image processing instruction, a prompt may be issued based on the voice interaction function of the smart glasses; under the condition that the user responds to the prompt information, response information is obtained; in the case where the response information includes an image processing instruction, an operation corresponding to the image processing instruction is performed on the target object.

Taking fig. 4b as an example, if the voice command of the user is "the kitten is lovely", the target object with the category of kittens can be obtained. The intelligent glasses can initiate prompt whether to perform image processing on the kittens or not, and respond to the prompt information when the user responds to the prompt information, wherein the response information is that the kittens are subjected to image matting processing, and then the intelligent glasses perform image matting processing on the kittens, and the image after the image matting processing is shown as an image on the right side of fig. 4 b. In the case that the response information is "do not image process cat" or does not respond to the prompt, the intelligent glasses give up the image processing operation for the target object.

In some embodiments, the user may roughly describe a class of candidate objects, whereby the screened target object may include a plurality of candidate objects. Under the condition that a plurality of candidate objects are included in the target object, a prompt is sent out based on the voice interaction function of the intelligent glasses; collecting user voice in response to the prompt; a final target object is selected from the plurality of candidate objects based on the user speech.

For example, the voice information of the user is "lovely cat," and the target image is shown in fig. 5, and the target image includes two kittens. In this case, a prompt can be given to the user, "do you want the kitten numbered 4 or the kitten numbered 5? ", when the user issues a cat with a response of" No. 4 ", the cat with No. 4 is determined as the target object.

In the embodiment of the disclosure, the target object of interest of the user can be accurately determined based on the voice interaction mode.

In some embodiments, to ensure the safety of the user in real time, the manner in which the markers of the candidate objects are shown in the target image may be implemented as: displaying a target image having a mark of each candidate object in a locally designated area in a display area of the smart glasses; and displaying the image acquired by the camera of the intelligent glasses in real time in an area outside the local designated area of the display area.

In some embodiments, the target image with the markers of each candidate object may be presented in picture-in-picture form, as shown in fig. 6a, with the target image with the markers of each candidate object presented in the upper left corner of the display area of the smart glasses, and the real-time acquired image presented in real-time in a large screen.

In the case that the display screen can be split, for example, the display screen of the smart glasses has two screens, so that the display screen can display the target image with the mark of each candidate object on the left screen of the display area of the smart glasses, and display the image acquired in real time on the right screen of the display area of the smart glasses, as shown in fig. 6b.

In short, in the case of smart glasses having one screen, the target image may occupy a partial area of the screen, and other areas display images acquired in real time so that the user can perceive the surrounding environment. In the case of smart glasses having two or more screens, the target image may occupy one of the screens for presentation, with the remaining screens being used to present the image acquired in real time.

When the method is implemented, under the condition of having one camera, the target image can be stored, the target image is processed and displayed by adopting an independent thread, and the image acquired in real time is responsible for rendering and displaying by another thread.

Under the condition that a plurality of cameras are arranged, the cameras for collecting target images can stop collecting, and images collected by other cameras in real time are used for displaying so that a user perceives surrounding environments.

In the embodiment of the disclosure, under the condition that the user views the marks of the candidate objects in the target image, the image is acquired in real time based on the camera at the same time, so that the road safety of the user is ensured.

Based on the same technical concept, the embodiment of the disclosure further provides a control device of the intelligent glasses, which includes:

the image acquisition module 701 is configured to acquire an image in real time through a camera of the smart glasses, and store the image in an image sequence;

the state processing module 702 is configured to obtain a movement state of the smart glasses in response to the collected voice information;

A first determining module 703 for selecting a target image of interest to the user from the image sequence based on the voice information and the movement state;

an obtaining module 704, configured to obtain a label of each candidate object in the target image;

A display module 705, configured to display the label of each candidate object in the target image;

A second determining module 706, configured to perform voice interaction with the user based on the label of each candidate object, so as to obtain a target object interested by the user in each candidate object.

In some embodiments, the first determining module is configured to:

Determining the change amount of the image content in the image sequence under the condition that the moving state indicates that the intelligent glasses are kept in a static state during the period of collecting voice information;

And under the condition that the variation of the image content is smaller than a preset threshold value, acquiring the last frame of image in the image sequence to obtain the target image.

In some embodiments, the first determining module is further to:

and selecting an image corresponding to the voice information from the image sequence based on the voice information under the condition that the variation of the image content is larger than or equal to a preset threshold value, so as to obtain a target image.

In some embodiments, the first determining module is further to:

When the movement state indicates that the position or the posture of the intelligent glasses is changed, an image corresponding to the voice information is selected from the image sequence based on the voice information, and a target image is obtained.

In some embodiments, the first determining module is further to:

in the case where the voice information includes a voice instruction, determining a start time point of the voice instruction;

And selecting an image with the acquisition time point closest to the starting time point of one voice instruction from the image sequence to obtain a target image.

In some embodiments, the first determining module is further to:

In the case that the voice information includes a plurality of voice instructions, determining a start time point of the last voice instruction;

And selecting an image with the acquisition time point closest to the starting time point of the last voice instruction from the image sequence to obtain a target image.

In some embodiments, for each candidate, the labeling of the candidate includes at least one of:

The category of the candidate object, the position information of the candidate object in the target image, text information included in the candidate object, and the extension information of the candidate object.

In some embodiments, the second determining module is configured to:

Acquiring a target keyword in a voice instruction which needs to be matched with a target object;

Matching the target keywords with keywords in the marks of the candidate objects;

and obtaining a candidate object corresponding to the mark matched with the target keyword to obtain a target object.

In some embodiments, the second determining module is further configured to:

under the condition that a plurality of candidate objects are included in the target object, a prompt is sent out based on the voice interaction function of the intelligent glasses;

collecting user voice in response to the prompt;

a final target object is selected from the plurality of candidate objects based on the user speech.

In some embodiments, a display module is used to:

displaying a target image having a mark of each candidate object in a locally designated area in a display area of the smart glasses; and

And displaying the image acquired by the camera of the intelligent glasses in real time in an area outside the local designated area of the display area.

In some embodiments, the apparatus further comprises a processing module for:

and responding to the image processing instruction acquired based on the voice interaction function, and executing the operation corresponding to the image processing instruction on the target object.

In some embodiments, the image quality of the target image meets the image quality requirement.

In some embodiments, further comprising:

and the prompt module is used for outputting prompt information under the condition that the image quality of the target image does not meet the image quality requirement.

In some embodiments, the state processing module is configured to, prior to acquiring the movement state of the smart glasses in response to acquiring the voice information: and under the condition that the moving speed of the intelligent glasses is lower than the speed threshold, automatically starting the voice interaction function.

In an embodiment of the disclosure, a wearable device is provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements a control method such as smart glasses when executing the computer program.

For descriptions of specific functions and examples of each module and sub-module of the apparatus in the embodiments of the present disclosure, reference may be made to the related descriptions of corresponding steps in the foregoing method embodiments, which are not repeated herein.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related user personal information all conform to the regulations of related laws and regulations, and the public sequence is not violated.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 8 illustrates a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile apparatuses, such as personal digital assistants, cellular telephones, smartphones, wearable devices, and other similar computing apparatuses. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The computing unit 801, the ROM802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

Various components in device 800 are connected to I/O interface 805, including: an input unit 806 such as a keyboard, mouse, etc.; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, etc.; and a communication unit 809, such as a network card, modem, wireless communication transceiver, or the like. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 performs the respective methods and processes described above, for example, a control method of smart glasses. For example, in some embodiments, the method of controlling the smart glasses may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 800 via ROM 802 and/or communication unit 809. When the computer program is loaded into the RAM 803 and executed by the computing unit 801, one or more steps of the control method of the smart glasses described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the control method of the smart glasses by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions, improvements, etc. that are within the principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A control method of intelligent glasses, comprising:

selecting a target image of interest to a user from the sequence of images based on the speech information and the movement state;

Obtaining marks of candidate objects in the target image;

Displaying the mark of each candidate object in the target image;

Performing voice interaction with the user based on the marks of the candidate objects to obtain target objects interested by the user in the candidate objects;

wherein selecting a target image of interest to a user from the sequence of images based on the speech information and the movement state comprises:

Determining an amount of change in image content in an image sequence if the moving state indicates that the smart glasses remain in a stationary state during the period of collecting the voice information;

2. The method of claim 1, further comprising:

And selecting an image corresponding to the voice information from the image sequence based on the voice information under the condition that the variation of the image content is larger than or equal to a preset threshold value, so as to obtain the target image.

3. The method of claim 1, further comprising:

And when the movement state indicates that the position or the posture of the intelligent glasses are changed, selecting an image corresponding to the voice information from the image sequence based on the voice information, and obtaining the target image.

4. A method according to claim 2 or 3, wherein said selecting an image corresponding to said speech information from said sequence of images based on said speech information, resulting in said target image, comprises:

in the case that the voice information includes one voice instruction, determining a start time point of the one voice instruction;

and selecting an image with the closest acquisition time point to the starting time point of the voice command from the image sequence to obtain the target image.

5. A method according to claim 2 or 3, wherein said selecting an image corresponding to said speech information from said sequence of images based on said speech information, resulting in said target image, comprises:

In the case that the voice information includes a plurality of voice instructions, determining a start time point of a last voice instruction;

And selecting an image with the closest acquisition time point to the starting time point of the last voice instruction from the image sequence to obtain the target image.

6. A method according to any of claims 1-3, wherein, for each candidate object, the labeling of the candidate object comprises at least one of:

the category of the candidate object, the position information of the candidate object in the target image, the text information included in the candidate object, and the extension information of the candidate object.

7. A method according to any one of claims 1-3, wherein the voice interaction with the user based on the labels of the candidate objects results in target objects of interest to the user among the candidate objects, comprising:

And obtaining candidate objects corresponding to the marks matched with the target keywords, and obtaining the target objects.

8. The method of claim 7, further comprising:

Under the condition that the target object comprises a plurality of candidate objects, a prompt is sent out based on the voice interaction function of the intelligent glasses;

Responding to the prompt to collect the voice of the user;

And selecting a final target object from the plurality of candidate objects based on the user voice.

9. The method of any of claims 1-3, 8, wherein the presenting the marker of each candidate object in the target image comprises:

Displaying the target image with the mark of each candidate object in a locally specified area in the display area of the smart glasses; and

Displaying the image acquired by the camera of the intelligent glasses in real time in an area outside the local appointed area of the display area.

10. The method of any of claims 1-3, 8, further comprising:

11. The method of any of claims 1-3, 8, wherein an image quality of the target image meets an image quality requirement.

12. The method according to any one of claims 1-3, 8, wherein a hint information is output in case the image quality of the target image does not meet an image quality requirement.

13. The method of any of claims 1-3, 8, further comprising, prior to the acquiring the movement state of the smart glasses in response to the acquiring the voice information: and under the condition that the moving speed of the intelligent glasses is lower than a speed threshold, automatically starting a voice interaction function.

14. A control device for smart glasses, comprising:

a first determining module for selecting a target image of interest to a user from the image sequence based on the speech information and the movement state;

the second determining module is used for performing voice interaction with the user based on the marks of the candidate objects to obtain target objects interested by the user in the candidate objects;

the first determining module is specifically configured to:

15. The apparatus of claim 14, the first determination module further to:

16. The apparatus of claim 14, wherein the first determination module is further configured to:

17. The apparatus of claim 15 or 16, wherein the first determining module is further configured to:

18. The apparatus of claim 15 or 16, wherein the first determining module is further configured to:

19. The apparatus of any of claims 14-16, wherein, for each candidate object, the label of the candidate object comprises at least one of:

20. The apparatus of any of claims 14-16, wherein the second determining module is configured to:

21. The apparatus of claim 20, the second determination module further to:

Responding to the prompt to collect the voice of the user;

22. The apparatus of any one of claims 14-16, 21, wherein the display module is to:

23. The apparatus of any of claims 14-16, 21, further comprising a processing module to:

24. The apparatus of any of claims 14-16, 21, wherein an image quality of the target image meets an image quality requirement.

25. The apparatus of any one of claims 14-16, 21, further comprising:

26. The apparatus of any of claims 14-16, 21, prior to the acquiring a movement state of the smart glasses in response to the acquiring the voice information, the state processing module is further to: and under the condition that the moving speed of the intelligent glasses is lower than a speed threshold, automatically starting a voice interaction function.

27. An electronic device, comprising:

at least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-13.

28. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-13.

29. A wearable device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the method of any of claims 1 to 13 when executing the computer program.