Increasing consistency in Chat GPT API responses

Chloe McAree

Over the past few months, I have been exploring the various capabilities of OpenAI’s Chat GPT API across a range of different projects and have been super impressed by its versatility. However, when building applications with it, I have encountered a particular challenge, especially when integrating it with other software tools — getting the model to consistently return responses in a reliable format.

When building applications using third-party APIs, you typically have a clear understanding of the expected response schema. This makes it easy to parse the response and trigger downstream logic. When building applications using AI models, the responses can be unreliable.

This unreliability has prompted me to explore different techniques that could help standardise the responses allowing efficient parsing and act upon the model’s output.

 

The Problem

Throughout this blog post, I will use real-world examples involving the parsing of customer service support logs to illustrate the different techniques. I also used AI (with a human review) to redact sensitive information.

In most companies, customer feedback is received through a variety of structured and unstructured channels. This feedback may come from review websites, email chains, WhatsApp groups, or Slack. While the diversity of sources makes for more complex integration, it’s the channels of unstructured data that are challenging to parse the feedback and pinpoint any issues that users are encountering.

Identifying these issues and flagging them for the development team becomes a complex task as unstructured customer support conversations often contain additional, (use-case irrelevant) data that easily produce false positives, among other things.

To approach this problem, I took a sample of our customer support data, redacted it for sensitive information, and divided it into several distinct chunks of data as preparation for my ChatGPT processing experiments. This division allowed me to test using different sets of customer data.

Here is my initial prompt. It informs the model that it will be provided with customer messages and needs to evaluate these messages to identify any support issues that have not yet been resolved. 

I fed in my first chunk of the customer data into the user’s content input (like shown in the prompt above) and here is the response:

The response appears promising; it categorises my input as “Support issues or requests that have not already been resolved.” This categorisation is something I could potentially parse for. While the structure isn’t perfect, the overall result is on target.

Let’s replicate the same prompt, this time using the next chunk of customer messages. We want to observe if we get a similar response. As shown below, the consistency is quite evident – it continues to be labelled as “Support issues or requests that have not already been resolved.”

 

 

I keep sending the same prompt with different sets of customer data and in one of the tests I received this as my response:

 

As you can see, this response is different from the others. If I had built my application based on the top two responses, it could have encountered issues with this particular response. It isn’t labelled the same as the other responses and the structure is different.

To address this situation, several strategies could be implemented. One of the initial steps is to be specific in our initial prompt regarding the desired format of the response!

 

Prompt engineering for better responses

Currently, our prompt is quite generic and lacks specific instructions for the model. So, let’s attempt to convey the precise format we want for this data.

As a software engineer developing a platform around this capability, I might want to take this response data and then use something like the Trello API to create support tickets. To achieve this, I would ideally need the response in a JSON format.

I would want the issues to be organised as an array of objects, where each object would have essential attributes related to the issue such as; title, time, description. So, let’s look at how I can update this prompt to include all this information.

After numerous trials and errors with different prompts, I found that the following one produced the best results:

 

Now, let’s break down this prompt to understand why it performs better.

 

Explicit Roles

Firstly, it is imperative to clearly communicate the desired persona the model should adopt in order to process your message. Notice how I use the phrase, “You are a JSON API for extracting, evaluating, and comprehending text from customer messages.” This instructs the model to treat the received messages as if it were a JSON API, offering a specific and unambiguous role.

 

Explicit Standards

It is also crucial to provide specific details about the expected format, potentially specifying a version or type, while making it explicit that deviations are unacceptable. As seen in the prompt, I state, “Only provide an RFC8259 compliant JSON response following this format without deviation.” This level of clarity ensures that the model comprehends the precise JSON format required.

 

Explicit References

Finally, supplying an example of the anticipated response is beneficial. In the prompt, I present the exact format I’m aiming for, along with the anticipated labels for the data. This enables me to structure my application code around this template and accurately extract these desired values.

The responses I received were as follows:

Example 1:

 

Example 2:

 

This does seem to significantly improve the reliability of the model responses, but I have still seen cases where it has added additional properties to the array objects that I didn’t ask for and cases were the JSON returned was invalid and so broke when my application tried to parse it.

 

Chat-GPT Function Calling

Another aspect I wanted to explore to understand its impact on the model’s response format reliability was Function Calling.

Last month Chat-GPT introduced a new feature called Function Calling in the GPT-4-0613 and GPT-3.5 Turbo-0613 models. It is the ability to describe existing functions from your own codebase to the Chat-GPT model.

I will show this in an example below, but when using the API, you can now send an array named “functions”, where you specify the name of the function, a description of what it does and a list of the parameters it should take (described as a JSON schema).

The model then intelligently decides, based on the input, if a specific function should be called or not.

If GPT Functions determines a function should be called, it will return with the name of that function in its response. This really is an intriguing capability and has the potential to greatly enhance the extensibility of our code.

An important thing to note is the Chat Completions API does NOT call the function; instead, the model generates JSON that you can use to call the function in your code.

 

Function Calling Examples

We will utilise the same customer support data and use case to test out the function calling feature. When thinking about how I would want this application to work, I would have a function set up in my code that is responsible for rising support tickets to Trello, it would take in an array of issues and then use the Trello API to create new tickets on our support team’s backlog.

In my initial attempt at utilising the function calling functionality, I omitted the strict JSON criteria from the prompt and introduced an array containing my Trello ticket function.

 

However, in this instance, the model chose to provide a conventional response rather than suggesting a function call:

 

Given the uncertainty of whether the model will opt to invoke a function or provide a standard response, the reliability issue becomes more challenging. Despite this, I iterated through different prompts in an effort to achieve successful function calls. Eventually, I got a response with the following prompt. Here, I directly instruct the model to identify the appropriate function to call if it identifies a customer support issue or request.

 

This is the response I received:

You can see when a function is retuned the “finish_reason” is set to function call. This simplifies the process of parsing responses. By examining the finish reason, we can determine whether to parse the response conventionally or extract a function name and parameters from it.

While experimenting with the prompt and the rest of my customer support samples, a few issues came to light. In some instances, the model did not suggest invoking the function and provided an unstructured response containing issues. On occasion, the model provided parameter values that were not explicitly specified, and I was not expecting and in one scenario, the model returned the following as a function call, resembling an attempt to generate function code:

 

 

I do like the idea of the function-calling responses and believe that eventually it will make it easier to build functionality on top of the information provided by Chat-GPT and overall enhance the versatility of your program. However, at present, the responses are quite unpredictable and so a level of post processing will be necessary!

It is also important to note, that functions and their descriptions are injected into the system messages and so they do contribute to your context limits, and you are billed on them as input tokens, so it is recommended that you limit the number of functions you provide.

 

Conclusion

Overall, the Chat-GPT API is extremely powerful, and I am super impressed with it! I do understand when working with any kind of AI model, some degree of uncertainty is expected and so building out safeguards in our applications becomes essential.

While working with these models, it’s crucial not only to invest in creating a robust ETL pipeline for pre-processing and cleaning data to achieve optimal results but also to dedicate time to prompt engineering and fine-tuning. This will help ensure your model has the right requirements, clear instructions and a goal.

Ultimately, whether you opt for prompt engineering or function-calling, encountering scenarios where the model produces unexpected or hallucinated responses remains a possibility at scale. I recommend dedicating time to develop response post-processing strategies. This might involve checks to validate JSON integrity or the presence of a suggested function — checking both the function name’s validity and the parameters’ correctness. In cases of failure in these checks, you could implement functionality to search the response for the required data or even establish a re-try mechanism to interact with the model until the desired response format is achieved.

LETS TALK.

Want to find out how the subject of this blog could help your business? 

Our blended team of experts go over and above with our services to our customers, no matter what the challenge. Get in touch to find out how we can work together.