Monday, August 4, 2025

Implementing Streaming Responses with Express and Azure OpenAI

Sergio Cardenas

Knowledge Company Automation Agents Azure AI

Context

Based on the previous article, I’ve shown how to establish HTTP communication between a client of our choice and a model provider.

It’s clear that while establishing the communication can be trivial, it comes with latency limitations, revealing how operating with a Request <> Response model leaves us with a User Experience debt we can’t ignore.

What we can do instead is request that tokens be sent to us as soon as they’re available, rather than waiting for the full response to be completed. In other words, we request a stream of data.

This doesn’t necessarily reduce the total time needed to generate the final response, but it significantly reduces the wait time for the first generated token.

TLDRS

You can find all code in the following Github repository.

Scope

We are in the point where if we want to build a modern applications we need to move beyond the simple request-response pattern.

The goal of this article is to enable data streaming with minimal and reliable configuration.

Core concepts

Let’s talk about SSE — a elegant way to stream updates from the API just built, and still using HTTP.

We consider SSE (Server-Sent Events) when we need to send real-time updates from the server to the client using a single, long-lived HTTP connection. This eliminates the need for the client to constantly request updates.

The following image shows, in general terms, how it would work in our implementation.

DIagram showing client server interactions

To simplify, it’s similar to using long-polling, but more efficient for one-way communication from the server to the client.

This makes simpler to implement and integrate into our existing HTTP infrastructure without the need for special protocol handling.

Why not WebSockets?

The simple and direct answer is that we’d be adding more complexity to something that was already covered with SSE.

If we want to use WebSockets, we need to consider the need for bidirectional communication, and real-time data.

For that, we must be aware that WebSockets require maintaining a connection between client and server.

In other words, operational overhead.

Requesting the response as a stream

Starting from the previous development, I need to upgrade the previous request call to Azure AI endpoint. Specifically, I must indicate that we want to receive the response as a stream.

To do this, it is required to add "stream": true to the payload.

// chat.controller.ts
export const chat = async (req: Request, res: Response) => {
    res.setHeader("Content-Type", "text/event-stream");
    res.setHeader("Cache-Control", "no-cache");
    res.setHeader("Connection", "keep-alive");

    const payload = {
        input: req.body.input,
        model: reqbodymodel  
         
    

     
      
      err 
        console err
        res
        res

It is also a good practice to set the appropriate headers of the response to the client who is requesting data to us, as streaming behavior is not guaranteed and may not be fully compliant with how clients like EventSource expect to receive data.

Processing the response

Next, the workflow has to check if Azure’s response was successful and has a body. If not, it can simply handle the error and end the connection right there.

Then, it has to continue by reading the response from Azure and convert it into valid SSE events.

After that, we filter and format each chunk so the client receives updates as soon as the events arrive. When all events have been received, the delta value in the response will be response.completed. That will be our flag.

try {
      const azureRes = await fetch(env.AZURE_ENDPOINT, {
          method: "POST",
          headers: {
              "Content-Type": "application/json;",
              "Authorization": `Bearer ${env.AZURE_API_KEY}`
          },
          body: JSON.stringify(payload)
      });

      if (!azureRes.ok  azureResbody 
           err   azureRes
          res
          res
          
      

      
       reader  azureResbody
       decoder   

        
            done value    reader
           done 

           chunk  decodervalue    

            line  chunk 
               line 
                   json  line 
                   output  json  

                  output  
                      res
                      res
                      
                  
                  res
              
          
      
      err 
        console err
        res
        res

Testing

That was it. In order to test if it works, we can use our terminal again and run the following curl:

curl -N -H "Content-Type: application/json" \
-X POST http://localhost:4000/v1/chat \
-d '{
      "input": "what are things I should do in Coyhaique, Chile. explain briefly? short and simple"
  }'

You might get a result like this:

data: {"type":"response.created","sequence_number":0,"response":{"id":"resp_...","object":"response","created_at":1754262348,"status":"in_progress","background":false,"content_filters":null,"error":null,"incomplete_details":null,"instructions":null,"max_output_tokens":null,"max_tool_calls":null,"model":"o4-mini","output":[],"parallel_tool_calls":true,"previous_response_id":null,"prompt_cache_key":null,"reasoning":{"effort"::null:null::true:10::::::10::null:null:



: ::109::1:0:

: ::110::1:0:

: ::111::1:0::::

: ::112:1:::::::::

event: done
:

I can see that as a success.

Conclusions

Having improved how we receive information, it’s important to realize that we’ve satisfied one requirement but introduced another. If we’re receiving data continuously until the server ends the stream, we need to account for this flow when defining our architecture.

If we place any proxy in front of our solution, we must be ready to handle increased complexity during debugging. We also need to keep in mind any cache invalidation policies and/or TTL definitions.

Equally important is the client-side implementation now that we’re streaming data. If we’re building a modern web application, it must support continuous data flow.

Again, we moved away from Request <> Response.

The right balance between the current implementation and the previous one lies in our ability to accept the tradeoffs each option presents.

In the long run, it’s important to evaluate the cost-benefit trade-offs of implementing observability, monitoring, security, and scalability measures.

Next steps

For the next entry, it will be interesting to implement the function calling feature, which is available in some OpenAI models.

Resources

https://platform.openai.com/docs/guides/production-best-practices#streaming

Social

Legal