CN117437903A

CN117437903A - Voice generation method, device, electronic equipment and storage medium

Info

Publication number: CN117437903A
Application number: CN202311437371.1A
Authority: CN
Inventors: 巩家兴; 王明远; 张健龙; 谢延哲; 申洋
Original assignee: Beijing Zitiao Network Technology Co Ltd; Lemon Inc Cayman Island
Current assignee: Beijing Zitiao Network Technology Co Ltd; Lemon Inc Cayman Island
Priority date: 2023-10-31
Filing date: 2023-10-31
Publication date: 2024-01-23

Abstract

The embodiment of the disclosure provides a voice generation method, a voice generation device, an electronic device and a storage medium, wherein the mobile information is used for representing the current mobile state of terminal equipment by acquiring the mobile information of the terminal equipment; obtaining a target image processing algorithm aiming at a target image according to the movement information, wherein the target image is an image shot by the terminal equipment in real time; and processing the target image based on a target image processing algorithm, and generating first prompt voice for representing travel prompt information of a user using the terminal equipment. The moving state of the terminal equipment is acquired to match the corresponding target image processing algorithm, and then the first prompt voice is generated based on the target image processing algorithm, so that the problem that the image processing effect is poor due to the moving state of the terminal equipment is avoided, the execution effect of the image processing algorithm is further influenced, and the accuracy and the response speed of the prompt voice are improved.

Description

Voice generation method, device, electronic equipment and storage medium

Technical Field

The embodiment of the disclosure relates to the technical field of image recognition, in particular to a voice generation method, a voice generation device, electronic equipment and a storage medium.

Background

At present, along with the rapid improvement of the equipment performance of the terminal equipment, the terminal equipment can realize the real-time detection of the surrounding environment of the terminal equipment by collecting the environment image and generate prompt voice, thereby providing an effective travel auxiliary function for user groups with travel demands, particularly groups with vision impairment.

However, the image processing algorithm based on image recognition in the prior art has the problems of inaccurate and untimely prompting voice, and influences the execution effect of the auxiliary function.

Disclosure of Invention

The embodiment of the disclosure provides a voice generation method, a voice generation device, electronic equipment and a storage medium, so as to solve the problems of inaccurate and untimely prompting voice.

In a first aspect, an embodiment of the present disclosure provides a method for generating speech, including:

acquiring mobile information of terminal equipment, wherein the mobile information is used for representing the current mobile state of the terminal equipment; obtaining a target image processing algorithm aiming at a target image according to the movement information, wherein the target image is an image shot by the terminal equipment in real time; and processing the target image based on the target image processing algorithm, and generating first prompt voice which is used for representing travel prompt information of a user using the terminal equipment.

In a second aspect, an embodiment of the present disclosure provides a speech generating apparatus, including:

the interaction module is used for acquiring the mobile information of the terminal equipment, wherein the mobile information is used for representing the current mobile state of the terminal equipment;

the processing module is used for obtaining a target image processing algorithm aiming at a target image according to the movement information, wherein the target image is an image shot by the terminal equipment in real time;

and the execution module is used for processing the target image based on the target image processing algorithm and generating first prompt voice which is used for representing travel prompt information of a user using the terminal equipment.

In a third aspect, an embodiment of the present disclosure provides an electronic device, including: a processor and a memory;

the memory stores computer-executable instructions;

the processor executes computer-executable instructions stored in the memory, causing the at least one processor to perform the speech generating method as described above in the first aspect and the various possible designs of the first aspect.

In a fourth aspect, embodiments of the present disclosure provide a computer-readable storage medium having stored therein computer-executable instructions which, when executed by a processor, implement the speech generating method according to the first aspect and the various possible designs of the first aspect.

In a fifth aspect, embodiments of the present disclosure provide a computer program product comprising a computer program which, when executed by a processor, implements the speech generating method according to the first aspect and the various possible designs of the first aspect.

According to the voice generation method, the voice generation device, the electronic equipment and the storage medium, the mobile information of the terminal equipment is obtained, and the mobile information is used for representing the current mobile state of the terminal equipment; obtaining a target image processing algorithm aiming at a target image according to the movement information, wherein the target image is an image shot by the terminal equipment in real time; and processing the target image based on the target image processing algorithm, and generating first prompt voice which is used for representing travel prompt information of a user using the terminal equipment. The moving state of the terminal equipment is acquired to match the corresponding target image processing algorithm, and then the first prompt voice is generated based on the target image processing algorithm, so that the problem that the image processing effect is poor due to the moving state of the terminal equipment is avoided, the execution effect of the image processing algorithm is further influenced, and the accuracy and the response speed of the prompt voice are improved.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the solutions in the prior art, a brief description will be given below of the drawings that are needed in the embodiments or the description of the prior art, it being obvious that the drawings in the following description are some embodiments of the present disclosure, and that other drawings may be obtained from these drawings without inventive effort to a person of ordinary skill in the art.

Fig. 1 is an application scenario diagram of a speech generation method provided in an embodiment of the present disclosure;

fig. 2 is a flowchart of a speech generating method according to an embodiment of the disclosure;

FIG. 3 is a schematic diagram of a target image processing algorithm according to an embodiment of the disclosure;

FIG. 4 is a flowchart of a specific implementation of step S102 in the embodiment shown in FIG. 2;

FIG. 5 is a flow chart of one possible implementation of step S1022 in the embodiment shown in FIG. 4;

FIG. 6 is a schematic diagram of another determination target image processing algorithm provided by an embodiment of the present disclosure;

FIG. 7 is a flow chart of another possible implementation of step S1022 in the embodiment shown in FIG. 4;

FIG. 8 is a flowchart of a specific implementation of step S103 in the embodiment shown in FIG. 2;

Fig. 9 is a second flowchart of a speech generating method according to an embodiment of the present disclosure;

fig. 10 is a schematic diagram of an interaction process for playing a first prompt voice according to an embodiment of the disclosure;

FIG. 11 is a block diagram of a speech generating device according to an embodiment of the present disclosure;

fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure;

fig. 13 is a schematic hardware structure of an electronic device according to an embodiment of the disclosure.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are some embodiments of the present disclosure, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without inventive effort, based on the embodiments in this disclosure are intended to be within the scope of this disclosure.

It should be noted that, the user information (including, but not limited to, user equipment information, user personal information, etc.) and the data (including, but not limited to, data for analysis, stored data, presented data, etc.) related to the present disclosure are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region, and be provided with corresponding operation entries for the user to select authorization or rejection.

The application scenario of the embodiments of the present disclosure is explained below:

fig. 1 is an application scenario diagram of a voice generating method according to an embodiment of the present disclosure, where the voice generating method according to the embodiment of the present disclosure may be applied to an application program having functions of travel assistance, intelligent travel assistant, etc., and more specifically, may be applied to application scenarios such as walking navigation. The execution body of the embodiment may be a terminal device running the application program with the image processing algorithm, or may be a server deploying a server corresponding to the application program, or other electronic devices playing similar functions. Referring to fig. 1, taking a terminal device as an example, the application program with the image processing algorithm running in the terminal device, for example, is a program specially designed for the visually impaired people, and of course, may also be an application program set for a normal user with travel requirements. The terminal device may be a handheld device, such as a smart phone, or may be a wearable device, such as the smart glasses 1 shown in the figure. The camera unit 11 is arranged on the intelligent glasses 1, the intelligent glasses 1 collect an environmental image in the current environment through the camera unit 11, and the environmental image is processed by the method provided by the embodiment to generate and play a prompting voice, and the content of the prompting voice is, for example, "notice the front intersection to come in" shown in the figure. Thereby carry out the suggestion to the user who dresses this intelligent glasses 1 to realize that the user goes out the in-process and carries out the trip auxiliary function that pronunciation guide, pronunciation were reminded.

In the prior art, an image processing algorithm based on image recognition needs to perform image recognition on an environment image acquired by a terminal device, and further performs subsequent processing based on the result of image recognition so as to finally generate a prompt voice matched with the content of the environment image. However, the inventors found in practical use that: the execution effect of the image processing algorithm, namely the accuracy and the instantaneity of the prompt voice, can be influenced by the moving state of the terminal equipment, when the terminal equipment is in a fast moving and jolting state, the quality of an environment image acquired by the terminal equipment can be influenced, and when the quality of the environment image is reduced in the prior art, the same processing algorithm is still adopted for processing the image, so that the problems of poor accuracy, poor instantaneity and the like of an image recognition result are caused, and the accuracy and the instantaneity of the prompt voice are further influenced.

Embodiments of the present disclosure provide a speech generation method to solve the above-described problems.

Referring to fig. 2, fig. 2 is a flowchart illustrating a speech generating method according to an embodiment of the disclosure. The method of the embodiment can be applied to a terminal device, and the voice generation method comprises the following steps:

step S101: and acquiring the mobile information of the terminal equipment, wherein the mobile information is used for representing the current mobile state of the terminal equipment.

Step S102: and obtaining a target image processing algorithm aiming at a target image according to the movement information, wherein the target image is an image shot by the terminal equipment in real time.

For example, referring to the application scenario schematic diagram shown in fig. 1, the terminal device is, for example, smart glasses worn by a user, and the moving state is used for characterizing whether the spatial position of the terminal device changes, where the moving state includes at least two states, namely a stationary state and a non-stationary state; as the name implies, the stationary state, i.e. the spatial position of the terminal device, does not change, whereas the non-stationary state, i.e. the spatial position of the terminal device, changes. Further, the mobile state may also include a state between a stationary state and a non-stationary state, and in a possible implementation, the mobile information may be represented by a normalized floating point value, for example, where the mobile information=0 indicates that the terminal device is in a stationary state, and where the mobile information=1 indicates that the terminal device is in a most intense mobile state. And other values between 0 and 1 respectively represent the corresponding movement intensity degree based on the data size.

In one possible implementation, a sensor for detecting the displacement of the terminal device is provided inside the terminal device, such as a gyroscope, an acceleration sensor, etc. The real-time mobile information of the terminal equipment can be obtained through the built-in sensor of the terminal equipment. Of course, in other possible implementations, the terminal device may also obtain the mobile information by communicating with other devices, for example, by a smart watch, a smart phone worn by the user. Alternatively, the terminal device may be a detachable unit of, for example, smart glasses, and the terminal device is detached from the wearable device such as smart glasses to implement an independent photographing function, in which case the terminal device may obtain the movement information through a sensor provided on the wearable device. The above ways of obtaining the mobile information may be set according to specific needs, and are not exemplified here.

Further, after the terminal device obtains the movement information, a target image processing algorithm matched with the movement information is obtained based on the movement information to process the target image shot currently in real time, and as the image characteristics of the target image are influenced by the current movement state of the terminal device characterized by the movement information, a certain mapping relation exists between the image characteristics of the target image and the movement state, and therefore, the target image processing algorithm for the target image obtained according to the movement information, namely, the image processing algorithm suitable for processing the target image. The step is equivalent to the step of utilizing the moving state of the terminal equipment to represent the image characteristics of the target image and performing the optimal processing algorithm (target image processing algorithm) suitable for processing the target image, thereby achieving the aim of improving the image processing quality and efficiency.

In one possible implementation manner, there is a fixed mapping relationship between the movement information and the target image processing algorithm, for example, based on a preset mapping relationship, when the movement information is expressed by an integer value, different movement information (numerical value) may be mapped into the corresponding target image processing algorithm. Fig. 3 is a schematic diagram of a target image processing algorithm according to an embodiment of the present disclosure, and referring to fig. 3, when movement information Q (Q in the figure, the same applies below) =0, the corresponding target image processing algorithm is algorithm A1; when 0< moving information Q < = M, the corresponding target image processing algorithm is algorithm A2; when 0< moving information Q < = N, the corresponding target image processing algorithm is algorithm A3; when the movement information Q > N, the corresponding target image processing algorithm is algorithm A4.

In another possible implementation manner, as shown in fig. 4, the specific implementation manner of step S102 includes:

step S1021: and acquiring scene information corresponding to the target image, wherein the scene information is used for representing a travel scene corresponding to the image content of the target image.

Step S1022: and obtaining a target image processing algorithm according to the scene information and the movement information.

For example, referring to the application scenario diagram of fig. 1, the terminal device performs real-time shooting (video recording) for the current environment through the camera unit, thereby obtaining a target image. The target image may be a video frame obtained by taking frames of the video obtained by shooting, or may be each frame of the video. Then, the terminal equipment identifies the target image to obtain scene information corresponding to the target image, wherein the scene information is used for representing a travel scene corresponding to the image content of the target image, for example, the travel scene represented by the scene information A is: an intra-bus scene, i.e. the terminal device (user) is currently located in a bus environment; the travel scene represented by the scene information B is as follows: the neighborhood scenario, i.e. the terminal device (user) is currently located within the neighborhood environment. The current environment type of the terminal equipment can be determined through the travel scene corresponding to the image content of the target image represented by the scene information, and then a more accurate image processing algorithm is determined based on the environment type and the movement information corresponding to the scene information. The obtaining of the scene information of the travel scene corresponding to the image content of the representation target image can be realized through a pre-trained image classification model, and the specific implementation manner is not repeated.

Since the above mapping relationship is inaccurate in some special situations when the image characteristics and the image quality are represented by the movement information of the terminal device, for example, when the user takes a vehicle, the movement speed of the terminal device is high, but the image quality of the image photographed inside the vehicle is not significantly reduced. In the step of the embodiment, the problems are corrected by acquiring the scene information of the image and combining the scene information, so that the problem of inaccurate selection of the image processing algorithm under the special environment category is avoided, and the rationality of the obtained target image processing algorithm is improved.

Further, as shown in fig. 5, in one possible implementation manner, the specific implementation manner of step S1022 includes:

step S1022A: obtaining a corresponding objective function according to the scene information;

step S1022B: and configuring an objective function based on the movement information to obtain an objective image processing algorithm.

In one possible implementation, the corresponding objective function for implementing the objective function is first obtained from the scene information, more specifically, for example, a function fun_1 () "for" recognizing a road sign in an image ", a function fun_2 ()" predicting a travel path of the vehicle in the image ", and so on. Then, the corresponding parameters are obtained by moving the information, and the objective function is configured, for example, when the communication information=0, the corresponding image pooling parameter is obtained as para_1, and when the communication information=0.5 is obtained, the corresponding image pooling parameter is obtained as para_2. The pooling parameter is a parameter for performing pooling processing on the image, and a specific implementation manner of the objective function is determined by configuring the pooling parameter into the objective function, that is, a target image processing algorithm is obtained.

Fig. 6 is a schematic diagram of another determining target image processing algorithm provided in an embodiment of the present disclosure, as shown in fig. 6, firstly, obtaining scene information info_1 by processing a target image, and then, based on the scene information info_1 (shown as info_1 in the drawing), obtaining an objective function func_1 (shown as func_1 in the drawing) suitable for a travel scene corresponding to the scene information info_1, where, specifically, the travel scene corresponding to the scene information info_1 is, for example, a block environment; the objective function func_1 used by the method is used for realizing the function of predicting the running path of the vehicle in the image; on the other hand, on the basis of the movement information info_2 (shown as info_2 in the figure), the execution parameter para_1 (shown as para_1 in the figure) in the above objective function is configured in combination with the above objective function func_1, the parameter para_1 being used for setting the image sharpening coefficient in the objective function func_1, for example. Then, a target image processing algorithm is generated according to the set target function func_1 (para_1, …), and the subsequent steps are executed.

In the step of this embodiment, the moving information is used to determine the target parameter matched with the target function obtained based on the scene information, so that the influence of the moving information on the target function is considered in the execution process of the target function, thereby improving the performance of the function and enabling the finally generated target image processing algorithm to achieve a better processing effect.

As shown in fig. 7, in another possible implementation manner, the target image processing algorithm includes at least two sub-processing algorithms, and the specific implementation manner of step S1022 includes:

step S1022C: determining a target function according to the scene information;

step S1022D: determining at least two sub-processing algorithms for realizing the target function according to the movement information;

step S1022E: and generating a target image processing algorithm according to the at least two sub-processing algorithms.

Illustratively, in another possible implementation manner, the target image processing algorithm is composed of at least two sub-processing algorithms, where the at least two sub-processing algorithms are orderly arranged, and each sub-processing algorithm corresponds to one processing flow node of the target image processing algorithm; when the movement information representing the current movement state of the terminal equipment is different, the efficiency and effect of the target image processing algorithm can be improved by changing the number of flow nodes and node content of the target image processing algorithm. For example, when the current moving speed (moving state) of the terminal device represented by the moving information is smaller than the first speed threshold, using an image pooling process (i.e., a sub-processing algorithm); and when the current moving speed of the terminal equipment represented by the moving information is greater than or equal to the first speed threshold value, one image pooling process and one image denoising process (namely two sub-processing algorithms) are used. The mapping relationship between the implementation manner of the sub-processing algorithm corresponding to the mobile information and the target function may be preset, which is not limited herein. And then, orderly combining the sub-processing algorithms obtained in the steps to generate a target image processing algorithm.

In the step of this embodiment, the plurality of sub-processing algorithms for implementing the target function are determined by moving the information, so that the implementation manner of the target image processing algorithm is dynamically adjusted, and the performance and efficiency of the target image processing algorithm are improved by reducing the processing steps of the target image processing algorithm (for example, when the image quality is good) or increasing the processing steps of the target image processing algorithm (for example, when the image quality is poor).

Step S103: and processing the target image based on a target image processing algorithm, and generating first prompt voice for representing travel prompt information of a user using the terminal equipment.

The target image processing algorithm is obtained based on the steps, and then the target image is used as an input parameter, and the target image is input into the target image processing algorithm, so that the target image can be processed until a first prompt voice for representing travel prompt information for a user using the terminal device is generated based on a processing result of the target image processing algorithm on the target image. The target image processing algorithm may directly output the first prompt voice after processing the target image, or may generate intermediate information after processing the target image, and then perform voice conversion based on the intermediate information to generate the first prompt voice.

Referring to fig. 8, in one possible implementation manner, the specific implementation manner of step S103 includes:

step S1031: and acquiring image features of the target image based on a target image processing algorithm, wherein the image features correspond to the target image processing algorithm.

Step S1032: based on the image features, image semantic text is obtained that characterizes the image content in the target image.

Step S1033: and generating a first prompt voice according to the image semantic text.

For example, the target image is first processed based on the target image processing algorithm to obtain the image feature of the target image, where the image feature may have various implementation manners, for example, the image feature may be a pixel map obtained after identifying the target object in the target image and capable of being used to indicate the target object (such as a vehicle, a pedestrian, and a highway) in the target image, or a feature matrix, or may be data characterized based on other array forms. The image feature corresponds to the target image processing algorithm, that is, the image feature is a processing result of the target image processing algorithm, for example, a target object indicated in the image feature is an object (vehicle) that needs to be identified for a target function (for example, predicting a running path of the vehicle in the image) corresponding to the target image processing algorithm.

Then, based on the image characteristics, semantic conversion is carried out to obtain an image semantic text in a text form, wherein the image semantic text can be composed of natural language text describing the image content. Specifically, by performing semantic conversion on the target object indicated and contained by the image feature, a text describing the target object, namely, an image semantic text, can be obtained, the process can be realized by accessing a language model deployed in a server through terminal equipment, and the specific process is not repeated. Still further, the first alert speech may be obtained by converting the image semantic text into corresponding speech, illustratively by using a speech generation engine. Finally, the terminal device may directly play the first prompt voice, or play the first prompt voice after receiving a play command of the user, or send the first prompt voice to other electronic devices in communication with the first prompt voice for playing, which may be set as required, and is not exemplified here.

In the embodiment of the disclosure, the mobile information is used for representing the current mobile state of the terminal equipment by acquiring the mobile information of the terminal equipment; obtaining a target image processing algorithm aiming at a target image according to the movement information, wherein the target image is an image shot by the terminal equipment in real time; and processing the target image based on a target image processing algorithm, and generating first prompt voice for representing travel prompt information of a user using the terminal equipment. The moving state of the terminal equipment is acquired to match the corresponding target image processing algorithm, and then the first prompt voice is generated based on the target image processing algorithm, so that the problem that the image processing effect is poor due to the moving state of the terminal equipment is avoided, the execution effect of the image processing algorithm is further influenced, and the accuracy and the response speed of the prompt voice are improved.

Referring to fig. 9, fig. 9 is a second flowchart of a speech generating method according to an embodiment of the disclosure. The embodiment further refines step S102 and adds a man-machine interaction step on the basis of the embodiment shown in fig. 2, and the voice generating method includes:

step S201: and acquiring the mobile information of the terminal equipment, wherein the mobile information is used for representing the current mobile state of the terminal equipment.

Step S202: and playing a second prompt voice according to the movement information, wherein the second prompt voice is used for representing at least one parameter to be selected aiming at the target image processing algorithm.

Step S203: and responding to a first instruction input by a user, and obtaining a target parameter from at least one parameter to be selected.

After obtaining the movement information, the terminal device generates and plays a second prompt voice to the user according to the current movement state of the terminal device characterized by the movement information, so as to provide at least one candidate parameter aiming at the target image processing algorithm for the user to select or confirm. For example, when the movement speed represented by the movement state is greater than v1, generating a parameter para_1 to be selected and a parameter para_2 to be selected, and generating a second prompt voice for playing the parameter para_1 to be selected and the parameter para_2 to be selected, where more specifically, for example, the content of the second prompt voice is: please select the image recognition mode, 1, high-precision mode; 2. high real-time mode. The "wherein" high precision mode "corresponds to the parameter to be selected para_1, and the" high real-time mode "corresponds to the parameter to be selected para_2. Alternatively, only the parameter para_3 to be selected is generated. The content of the second prompt voice is as follows: "whether to turn on the in-vehicle mode". Wherein the "in-vehicle mode" corresponds to the parameter to be selected para_3.

And then, responding to the content played by the second prompt voice, for example, inputting a first instruction to the terminal equipment by means of voice input or case input, and determining an alternative parameter indicated by the first instruction from the one or more alternative parameters as a target parameter by the terminal equipment according to the first instruction. Of course, in other possible implementations, if the first instruction input by the user is not received, the target parameter is determined from the one or more alternative parameters randomly or based on a default manner.

The target parameters in this embodiment are used to configure the target function to obtain a target image processing algorithm for implementing the target function. In a possible implementation manner, the target function is preset, that is, the present embodiment is a processing method for a specified function running in the terminal device, and more specifically, the target function is "detecting a surrounding vehicle", for example. To achieve this function, at least the above-mentioned target parameters, such as "high real-time mode" or "high precision mode", need to be used. Because the target function is preset, after the movement information is obtained, the corresponding alternative parameters for realizing the target function can be obtained based on the movement information, so that the target parameters are obtained, and the subsequent algorithm configuration step is executed.

Step S204: and obtaining a corresponding objective function according to the movement information.

Step S205: and configuring an objective function based on the objective parameter to obtain an objective image processing algorithm.

On the other hand, the corresponding objective function is obtained by moving the information, and based on the above description, the objective function in this embodiment is preset, for example, the objective function is "detecting the surrounding vehicle", but there are multiple ways to implement the objective function, that is, by changing the parameters of the objective parameters, different image processing algorithms are obtained, so that the objective function is implemented in different ways. Specifically, a processing function matched with the movement information, namely a target processing function, is determined according to the movement information. More specifically, when the movement speed represented by the movement information is greater than the speed threshold, processing the target image using the function func_1 as an objective function; and when the movement speed represented by the movement information is smaller than the speed threshold value, the function func_2 is used as an objective function to process the target image.

Further, the target function is configured based on the target parameters obtained in the previous step, that is, the target parameters are used as input parameters of the target function, and then the target image processing algorithm can be obtained. This process is described in the previous embodiments and will not be repeated here.

Step S206: and playing a third prompt voice, wherein the third prompt voice is used for indicating at least one target operation aiming at the terminal equipment, and the target operation is used for setting at least one equipment parameter of the terminal equipment related to a target image processing algorithm to be in a target state matched with the mobile information.

Step S207: in response to a target operation applied by a user, a target device parameter of the terminal device is set to a target state.

Further, illustratively, since the target image processing algorithm is executed by the terminal device, generates and plays the first alert voice, the setting parameters of the terminal device itself may affect the execution effect of the target image processing algorithm. For example, according to the specific implementation of the target image processing algorithm, whether the resources allocated to the terminal device are matched with the target image processing algorithm may affect the instantaneity of the target image processing algorithm for generating the first prompt voice. In this embodiment, after the target image processing algorithm is obtained, the terminal device further plays a third prompting voice, where the third prompting voice is used to instruct at least one target operation for the terminal device, so as to play a role in prompting and guiding the user to set the device parameters of the terminal device, so that the device parameters are matched with the target image processing algorithm to be executed currently. Specifically, the content of the third prompting voice is, for example: "is the subsequent calculation required to be greater, is the energy saving mode turned off? "and, for example, are: please turn on the satellite communication function ". The "energy saving mode" and the "satellite communication function" in the third prompting voice are the device parameters related to the target image processing algorithm, i.e. the target device parameters. Then, in response to a target operation applied by the user, a target device parameter of the terminal device is set to a target state. For example, the "power saving mode" of the terminal device is turned off, etc.

In the step of this embodiment, after the target image processing algorithm is determined, the third prompt voice is further played to guide the user to set the device parameters of the terminal device, so that the device parameters of the terminal device can be matched with the algorithm requirement of the target image processing algorithm to be operated subsequently, and further the execution effect of the target image processing algorithm is improved.

Fig. 10 is a schematic diagram of an interaction process for playing a first prompt voice according to an embodiment of the present disclosure, and the process is further described below with reference to fig. 10. As shown in fig. 10, after a terminal device starts a target function of "detect surrounding vehicles", firstly, the terminal device obtains current real-time movement information, and plays a second prompt voice "please select an image recognition mode" based on the movement information: 1. a high precision mode; 2. high real-time mode). After that, the user inputs a user instruction "high-precision mode" to the terminal device in a voice input manner through the terminal device. Then, the terminal device determines a target parameter, namely, a function parameter para_1 corresponding to the "high-precision mode", according to the user instruction, and the function parameter para_1 is applied to a target function fun_1 () for realizing the target function. Then, based on the function parameters para_1 and fun_1 (), the target image processing algorithm algorithm_1 is generated, and the image processing algorithm algorithm_1 includes at least the target function fun_1 (para_1, …) in which the function parameter para_1 is arranged. Further, the terminal device plays a third prompting voice generated based on the target processing algorithm algorithm_1 and used for indicating the target operation, and the content of the third prompting voice is, for example, "whether to start a high performance mode". After receiving the user command "on" from the user, the satellite communication function of the terminal device is set to be in an on state, and still further, the target image processing algorithm algorithm_1 is executed to process the target image, and finally, a first prompt voice is generated and played, for example, the content of the first prompt voice is "1.2 m vehicles are on the front right side" as shown in the figure. After that, the above steps may be cyclically performed, for example, to realize a travel assist function of continuously detecting vehicles around the user.

Step S208: and processing the target image based on a target image processing algorithm to generate a first prompt voice.

Optionally, before playing the first prompt voice, the method further includes:

step S209: and obtaining the target playing speed according to the moving information.

Step S210: and playing the first prompt voice based on the target playing rate.

In an exemplary embodiment, in a travel scenario of a user, a terminal device performs information prompting and reminding effects on the user by playing a first prompting voice, which is affected by a movement speed of the user (terminal device), for example, travel prompting information represented by the first prompting voice is a prompt for an obstacle (driving, vehicle, etc.) of a visually impaired user, and when the movement speed of the user is relatively fast, if the first prompting voice is played at a fixed speed, the prompt instantaneity of the travel prompting information is insufficient and the timely reminding effect cannot be achieved. In this case, the present embodiment dynamically sets the target play rate of the first alert voice, for example, by according to the walking speed (movement information) of the user, specifically, for example, when the walking speed of the user is greater than the preset speed threshold, plays the first alert voice based on the higher target play rate; otherwise, when the walking speed of the user is smaller than the preset speed threshold, the first prompt voice is played based on the lower target playing speed, so that the playing speed of the first prompt voice is matched with the moving state of the terminal equipment, the playing instantaneity of the first prompt voice is improved, and the effect of power consumption of the equipment can be considered. The target playing rate may be a playing speech rate of the first prompting voice, or may be playing intervals between a plurality of first prompting voices, that is, a playing frequency of the first prompting voice, which may be specifically set according to needs.

In this embodiment, the implementation manner of step S201 and step S208 is the same as the implementation manner of step S101 and step S103 in the embodiment shown in fig. 2 of the present disclosure, and will not be described in detail here.

Corresponding to the voice generating method of the above embodiment, fig. 11 is a block diagram of the voice generating apparatus provided by the embodiment of the present disclosure. For ease of illustration, only portions relevant to embodiments of the present disclosure are shown. Referring to fig. 11, the speech generating apparatus 3 includes:

the interaction module 31 is configured to obtain movement information of the terminal device, where the movement information is used to characterize a current movement state of the terminal device;

the processing module 32 is configured to obtain a target image processing algorithm for a target image according to the movement information, where the target image is an image captured by the terminal device in real time;

the execution module 33 is configured to process the target image based on a target image processing algorithm, and generate a first prompt voice, where the first prompt voice is used to characterize travel prompt information for a user using the terminal device.

In one embodiment of the present disclosure, the interaction module 31 is further configured to: after the mobile information of the terminal equipment is acquired, playing a second prompt voice according to the mobile information, wherein the second prompt voice is used for representing at least one to-be-selected parameter aiming at the target image processing algorithm; responding to a first instruction input by a user, and obtaining a target parameter from at least one parameter to be selected; the processing module 32 is specifically configured to, when obtaining a target image processing algorithm for a target image according to the movement information: and obtaining a target image processing algorithm based on the movement information and the target parameters.

In one embodiment of the present disclosure, the processing module 32 is specifically configured to, when obtaining the target image processing algorithm based on the movement information and the target parameter: obtaining a corresponding objective function according to the movement information; and configuring an objective function based on the objective parameter to obtain an objective image processing algorithm.

In one embodiment of the present disclosure, the interaction module 31 is further configured to, prior to processing the target image based on the target image processing algorithm, generate the first alert speech: and playing a third prompt voice, wherein the third prompt voice is used for indicating at least one target operation aiming at the terminal equipment, and the target operation is used for setting at least one equipment parameter of the terminal equipment related to a target image processing algorithm to be in a target state matched with the mobile information.

In one embodiment of the present disclosure, the interaction module 31 is further configured to: obtaining a target playing rate according to the moving information; and playing the first prompt voice based on the target playing rate.

In one embodiment of the present disclosure, the processing module 32 is specifically configured to: acquiring scene information corresponding to a target image, wherein the scene information is used for representing a travel scene corresponding to the image content of the target image; and obtaining a target image processing algorithm according to the scene information and the movement information.

In one embodiment of the present disclosure, the processing module 32 is specifically configured to, when obtaining the target image processing algorithm according to the scene information and the movement information: obtaining a corresponding objective function according to the scene information; and configuring an objective function based on the movement information to obtain an objective image processing algorithm.

In one embodiment of the present disclosure, the target image processing algorithm includes at least two sub-processing algorithms; the processing module 32 is specifically configured to, when obtaining the target image processing algorithm according to the scene information and the movement information: determining a target function according to the scene information; determining at least two sub-processing algorithms for realizing the target function according to the movement information; and generating a target image processing algorithm according to the at least two sub-processing algorithms.

In one embodiment of the present disclosure, the execution module 33 is specifically configured to: acquiring image features of a target image based on a target image processing algorithm, wherein the image features correspond to the target image processing algorithm; obtaining an image semantic text representing image content in the target image based on the image features; and generating a first prompt voice according to the image semantic text.

The interaction module 31, the processing module 32 and the execution module 33 are sequentially connected. The voice generating apparatus 3 provided in this embodiment may execute the technical solution of the foregoing method embodiment, and its implementation principle and technical effects are similar, which is not described herein again.

Fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure, as shown in fig. 12, the electronic device 4 includes:

a processor 41 and a memory 42 communicatively connected to the processor 41;

memory 42 stores computer-executable instructions;

processor 41 executes computer-executable instructions stored in memory 42 to implement the speech generation method in the embodiment shown in fig. 2-10.

Wherein optionally the processor 41 and the memory 42 are connected by a bus 43.

The relevant descriptions and effects corresponding to the steps in the embodiments corresponding to fig. 2 to fig. 10 may be understood correspondingly, and are not described in detail herein.

The embodiments of the present disclosure provide a computer readable storage medium, in which computer executable instructions are stored, which when executed by a processor are configured to implement the speech generating method provided by any one of the embodiments corresponding to fig. 2 to 10 of the present disclosure.

In order to achieve the above embodiments, the embodiments of the present disclosure further provide an electronic device.

Referring to fig. 13, there is shown a schematic structural diagram of an electronic device 900 suitable for use in implementing embodiments of the present disclosure, where the electronic device 900 may be a terminal device or a server. The terminal device may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a personal digital assistant (Personal Digital Assistant, PDA for short), a tablet (Portable Android Device, PAD for short), a portable multimedia player (Portable Media Player, PMP for short), an in-vehicle terminal (e.g., an in-vehicle navigation terminal), and the like, and a fixed terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 13 is merely an example and should not impose any limitations on the functionality and scope of use of embodiments of the present disclosure.

As shown in fig. 13, the electronic apparatus 900 may include a processing device (e.g., a central processor, a graphics processor, or the like) 901, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 902 or a program loaded from a storage device 908 into a random access Memory (Random Access Memory, RAM) 903. In the RAM 903, various programs and data necessary for the operation of the electronic device 900 are also stored. The processing device 901, the ROM 902, and the RAM 903 are connected to each other through a bus 904. An input/output (I/O) interface 905 is also connected to the bus 904.

In general, the following devices may be connected to the I/O interface 905: input devices 906 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, and the like; an output device 907 including, for example, a liquid crystal display (Liquid Crystal Display, LCD for short), a speaker, a vibrator, and the like; storage 908 including, for example, magnetic tape, hard disk, etc.; and a communication device 909. The communication means 909 may allow the electronic device 900 to communicate wirelessly or by wire with other devices to exchange data. While fig. 13 shows an electronic device 900 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via the communication device 909, or installed from the storage device 908, or installed from the ROM 902. When executed by the processing device 901, performs the above-described functions defined in the methods of the embodiments of the present disclosure.

It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.

The computer-readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to perform the methods shown in the above-described embodiments.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a local area network (Local Area Network, LAN for short) or a wide area network (Wide Area Network, WAN for short), or it may be connected to an external computer (e.g., connected via the internet using an internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. The name of the unit does not in any way constitute a limitation of the unit itself, for example the first acquisition unit may also be described as "unit acquiring at least two internet protocol addresses".

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

In a first aspect, according to one or more embodiments of the present disclosure, there is provided a speech generation method, including:

According to one or more embodiments of the present disclosure, after the acquiring the movement information of the terminal device, the method further includes: playing a second prompt voice according to the movement information, wherein the second prompt voice is used for representing at least one parameter to be selected aiming at the target image processing algorithm; responding to a first instruction input by a user, and obtaining a target parameter from the at least one parameter to be selected; according to the movement information, a target image processing algorithm for a target image is obtained, and the target image processing algorithm comprises the following steps: and obtaining a target image processing algorithm based on the movement information and the target parameters.

According to one or more embodiments of the present disclosure, a target image processing algorithm is obtained based on the movement information and the target parameter, including: obtaining a corresponding objective function according to the movement information; and configuring the objective function based on the objective parameter to obtain an objective image processing algorithm.

According to one or more embodiments of the present disclosure, before the processing the target image based on the target image processing algorithm, generating a first alert voice, further includes: and playing a third prompt voice, wherein the third prompt voice is used for indicating at least one target operation aiming at the terminal equipment, and the target operation is used for setting at least one equipment parameter of the terminal equipment, which is related to the target image processing algorithm, to be in a target state matched with the mobile information.

According to one or more embodiments of the present disclosure, the method further comprises: obtaining a target playing rate according to the moving information; and playing the first prompt voice based on the target playing speed.

According to one or more embodiments of the present disclosure, the obtaining a target image processing algorithm for a target image according to the movement information includes: acquiring scene information corresponding to the target image, wherein the scene information is used for representing a travel scene corresponding to the image content of the target image; and obtaining a target image processing algorithm according to the scene information and the movement information.

According to one or more embodiments of the present disclosure, the obtaining a target image processing algorithm according to the scene information and the movement information includes: obtaining a corresponding objective function according to the scene information; and configuring the objective function based on the movement information to obtain a target image processing algorithm.

According to one or more embodiments of the present disclosure, the target image processing algorithm includes at least two sub-processing algorithms; the obtaining a target image processing algorithm according to the scene information and the movement information comprises the following steps: determining a target function according to the scene information; determining at least two sub-processing algorithms for realizing the target function according to the movement information; and generating the target image processing algorithm according to the at least two sub-processing algorithms.

According to one or more embodiments of the present disclosure, the processing the target image based on the target image processing algorithm, generating a first alert voice, includes: acquiring image features of the target image based on the target image processing algorithm, wherein the image features correspond to the target image processing algorithm; obtaining an image semantic text representing image content in the target image based on the image features; and generating a first prompt voice according to the image semantic text.

In a second aspect, according to one or more embodiments of the present disclosure, there is provided a speech generating apparatus comprising:

According to one or more embodiments of the present disclosure, the interaction module is further configured to: after the mobile information of the terminal equipment is acquired, playing a second prompt voice according to the mobile information, wherein the second prompt voice is used for representing at least one parameter to be selected aiming at the target image processing algorithm; responding to a first instruction input by a user, and obtaining a target parameter from the at least one parameter to be selected; the processing module is specifically configured to, when obtaining a target image processing algorithm for a target image according to the movement information: and obtaining a target image processing algorithm based on the movement information and the target parameters.

According to one or more embodiments of the present disclosure, the processing module is specifically configured to, when obtaining a target image processing algorithm based on the movement information and the target parameter: obtaining a corresponding objective function according to the movement information; and configuring the objective function based on the objective parameter to obtain an objective image processing algorithm.

In accordance with one or more embodiments of the present disclosure, the interaction module is further configured to, prior to the processing the target image based on the target image processing algorithm, generate a first alert voice: and playing a third prompt voice, wherein the third prompt voice is used for indicating at least one target operation aiming at the terminal equipment, and the target operation is used for setting at least one equipment parameter of the terminal equipment, which is related to the target image processing algorithm, to be in a target state matched with the mobile information.

According to one or more embodiments of the present disclosure, the interaction module is further configured to: obtaining a target playing rate according to the moving information; and playing the first prompt voice based on the target playing speed.

According to one or more embodiments of the present disclosure, the processing module is specifically configured to: acquiring scene information corresponding to the target image, wherein the scene information is used for representing a travel scene corresponding to the image content of the target image; and obtaining a target image processing algorithm according to the scene information and the movement information.

According to one or more embodiments of the present disclosure, the processing module is specifically configured to, when obtaining a target image processing algorithm according to the scene information and the movement information: obtaining a corresponding objective function according to the scene information; and configuring the objective function based on the movement information to obtain a target image processing algorithm.

According to one or more embodiments of the present disclosure, the target image processing algorithm includes at least two sub-processing algorithms; the processing module is specifically configured to, when obtaining a target image processing algorithm according to the scene information and the movement information: determining a target function according to the scene information; determining at least two sub-processing algorithms for realizing the target function according to the movement information; and generating the target image processing algorithm according to the at least two sub-processing algorithms.

According to one or more embodiments of the present disclosure, the execution module is specifically configured to: acquiring image features of the target image based on the target image processing algorithm, wherein the image features correspond to the target image processing algorithm; obtaining an image semantic text representing image content in the target image based on the image features; and generating a first prompt voice according to the image semantic text.

In a third aspect, according to one or more embodiments of the present disclosure, there is provided an electronic device comprising: at least one processor and memory;

the memory stores computer-executable instructions;

the at least one processor executes the computer-executable instructions stored in the memory, causing the at least one processor to perform the speech generating method as described above in the first aspect and the various possible designs of the first aspect.

In a fourth aspect, according to one or more embodiments of the present disclosure, there is provided a computer-readable storage medium having stored therein computer-executable instructions which, when executed by a processor, implement the speech generating method as described above in the first aspect and the various possible designs of the first aspect.

In a fifth aspect, according to one or more embodiments of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the speech generating method according to the first aspect and the various possible designs of the first aspect.

The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).

Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims.

Claims

1. A method of generating speech, comprising:

acquiring mobile information of terminal equipment, wherein the mobile information is used for representing the current mobile state of the terminal equipment;

Obtaining a target image processing algorithm aiming at a target image according to the movement information, wherein the target image is an image shot by the terminal equipment in real time;

and processing the target image based on the target image processing algorithm, and generating first prompt voice which is used for representing travel prompt information of a user using the terminal equipment.

2. The method according to claim 1, further comprising, after the acquiring the movement information of the terminal device:

playing a second prompt voice according to the movement information, wherein the second prompt voice is used for representing at least one parameter to be selected aiming at the target image processing algorithm;

responding to a first instruction input by a user, and obtaining a target parameter from the at least one parameter to be selected;

according to the movement information, a target image processing algorithm for a target image is obtained, and the target image processing algorithm comprises the following steps:

and obtaining a target image processing algorithm based on the movement information and the target parameters.

3. The method of claim 2, wherein deriving a target image processing algorithm based on the movement information and the target parameter comprises:

Obtaining a corresponding objective function according to the movement information;

and configuring the objective function based on the objective parameter to obtain an objective image processing algorithm.

4. The method of claim 1, further comprising, prior to said processing said target image based on said target image processing algorithm, generating a first alert voice:

and playing a third prompt voice, wherein the third prompt voice is used for indicating at least one target operation aiming at the terminal equipment, and the target operation is used for setting at least one equipment parameter of the terminal equipment, which is related to the target image processing algorithm, to be in a target state matched with the mobile information.

5. The method according to claim 1, wherein the method further comprises:

obtaining a target playing rate according to the moving information;

and playing the first prompt voice based on the target playing speed.

6. The method according to claim 1, wherein the obtaining a target image processing algorithm for a target image according to the movement information comprises:

acquiring scene information corresponding to the target image, wherein the scene information is used for representing a travel scene corresponding to the image content of the target image;

And obtaining a target image processing algorithm according to the scene information and the movement information.

7. The method of claim 6, wherein the obtaining a target image processing algorithm based on the scene information and the movement information comprises:

obtaining a corresponding objective function according to the scene information;

and configuring the objective function based on the movement information to obtain a target image processing algorithm.

8. The method of claim 6, wherein the target image processing algorithm comprises at least two sub-processing algorithms; the obtaining a target image processing algorithm according to the scene information and the movement information comprises the following steps:

determining a target function according to the scene information;

determining at least two sub-processing algorithms for realizing the target function according to the movement information;

and generating the target image processing algorithm according to the at least two sub-processing algorithms.

9. The method of claim 1, wherein the processing the target image based on the target image processing algorithm to generate a first alert voice comprises:

acquiring image features of the target image based on the target image processing algorithm, wherein the image features correspond to the target image processing algorithm;

Obtaining an image semantic text representing image content in the target image based on the image features;

and generating a first prompt voice according to the image semantic text.

10. A speech generating apparatus, comprising:

11. An electronic device, comprising: a processor and a memory;

the memory stores computer-executable instructions;

the processor executing computer-executable instructions stored in the memory, causing the processor to perform the speech generating method of any one of claims 1 to 9.

12. A computer readable storage medium having stored therein computer executable instructions which, when executed by a processor, implement the speech generating method of any of claims 1 to 9.

13. A computer program product comprising a computer program which, when executed by a processor, implements the speech generating method according to any one of claims 1 to 9.