
Explanation:
Box 1: Speech-to-text
You use Speech-to-text recognition when you need to identify the language in an audio source and then transcribe it to text.
Box 2: Text to Speech
The output is voice. Text-to-speech enables your applications, tools, or devices to convert text into humanlike synthesized speech. The text-to-speech capability is also known as speech synthesis. Use humanlike prebuilt neural voices out of the box, or create a custom neural voice that's unique to your product or brand.
Reference:
https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/speech-to-text
https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/text-to-speech