Sayan Deb Kundu

About Speaker

Sayan is heading Data Science Chapter in India at Telstra. With 8 US patents filed across machine learning & conversational intelligence, Sayan brings 14 years of analytics cum data science experience spanning across Energy Sector ,Pharma, Telecommunications, Retail, CPG and Education. A Silver Medallist from IIT Kharagpur (B.Tech.), Sayan has completed PGDM (MBA) from IIM Bangalore in 2014. Recently Sayan has been recognized as 40 under 40 professionals in Technology in Supply Chain. Apart from heading Data Science team, Sayan has put together a community of technology enthusiasts in form of “Deep Tech Club” in Telstra, driving product impact at scale through applied research in AI (reinforcement learning, LLM) other deep tech disciplines like Blockchain, Quantum Computing and Cybersecurity. Sayan is passionate about mentoring startups, hackathons, Cricket and trekking.

Data Science Chapter Lead at Telstra

Interactive Talk - Testing Gen AI Applications

Testing Gen AI Applications

Generative AI applications have seen a huge boom in the last few months. However, most of these interesting applications have been limited to PoC phase and the shift to productionising these applications has several challenges. One of the key challenges arise in terms of reliability of output which arise due to non-deterministic nature of LLMs. Hence the testing of Gen AI applications become very tricky.

Running the same model twice with the same prompt can return slightly different responses. The format of the response is also plain text. Since developers have no way to predict precisely how a model will respond. To evaluate the quality of LLM responses – updating prompts in GenAI applications can introduce unintended consequences and may lead to regressions in the application’s behaviour. With LLMs acting as black boxes, prompt testing empowers developers to switch models and ensure consistent behaviour with model updates.

Defining Test Cases and Generating Test Prompts

Identify what aspects require the most quality assurance – our primary concern was the consistency of the tool’s reviews. To measure this consistency, we focused on a list of common coding flaws that the tool should identify, such as unintentionally exposing secrets, excessive code nesting, complex functions, and missing “await” keywords for promises – each being a test case, did the tool consistently identify and comment?

Evaluating the Quality of AI Responses

Using snapshot testing approaches is often not an effective option – Asking an LLM if a given review matches the test case and if the test should be passing would probably work. But doing so implies using a prompt to test prompts, which lead to spiraling uncertainties.

We opted for a different approach, using snapshots (examples of results we consider satisfying and that we expect) as reference points for evaluation. The snapshots we used are plain text files containing an ideal response – Using snapshots reduced the complexity of evaluating the quality of AI responses down to comparing two plain text documents. AI happens to be very good at comparing text documents. We can use an AI embeddings model to generate vector embeddings of the two documents.

An embedding of a document is a translation of the text into multiple vectors which encapsulate the semantic meaning of the underlying text. Two similar documents will have embeddings which are closely distanced, whereas very different documents will have embeddings spaced far apart. An approach we refer to as “Embedded Snapshots”. – PASS, WARN, and FAIL with WARN with human intervention could be an effective option in this regard.

As a future step, we can also add regression testing for the changes in the underlying LLM (fine tuning, changing the underlying LLM, etc.) and therefore maintain the stability of their GenAI applications.

Hear what Sayan has to say about the interactive session

#ATAGTR2023 Speaker

About Speaker

Sayan Deb Kundu

More Speakers