QK Helps Leading Indian Insurer Evaluate its Gen AI-powered Chatbot

Overview

QK, a global leader in AI-powered reliability and quality engineering, collaborated with a leading Indian insurer to optimize the performance of its customer-facing AI-powered chatbot. The chatbot, powered by generative AI complemented with Retrieval-Augmented generation (RAG), aimed to enhance customer experience by providing instant, accurate, and personalized responses to inquiries related to our client’s insurance products and services
0 +

Digital
modernization
projects delivered

0 %

Reduction in
regression testing
times

0 %

Reduction in
Quality Engineering
costs

Zero

Critical
production
defects

Client Overview

Our client is a leading life insurance company committed to making insurance accessible to all. As part of its commitment, it offers a range of insurance products to cater to individuals’ and corporations’ diverse financial security needs. Dedicated to customer-centricity, the insurance giant is actively transforming into a digital organization, focusing on customer experience and leveraging technologies like AI, ML, and data analytics for improved services and efficiency.

Business Objectives & Challenges

To improve customer experience and personalization for their product, our client designed an AI-powered chatbot harnessing generative AI and RAG to instantly and accurately answer customer queries about its range of products. Our client wanted to failproof the effectiveness and accuracy of the AI-powered chatbot while identifying areas of improvement to ensure it delivers value to their end users and augments their digital-first business vision.

Complex Functional Requirements:

The app needed to support symptom tracking, medication adherence, trigger management, and integration with wearable devices, making functional testing critical.

User Experience Optimization:

Many patients with chronic conditions are not highly tech-savvy, so the app needed to be intuitive and easy to navigate for users of all ages

Performance and Compatibility:

The app had to perform seamlessly across different devices (smartphones, tablets) and operating systems (Android, iOS), even under varying conditions such as low-battery or offline use

Data Security and Compliance:

Given the sensitive nature of healthcare data, the app needed to comply with HIPAA and other relevant regulations to ensure patient privacy and data security.

Real-World Usability:

The app had to be tested with real patients to ensure it met their needs and provided tangible benefits in managing their conditions.

Goals

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry’s standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.

Accelerate
time to market

Improve cost
efficiencies

Improve ROI for
investments made
in innovation

Access
scalability

The QK Strategy: AI-Augmented QE Framework for Chatbot Testing

After thoroughly assessing our client’s requirements and building on our in-house quality engineering of AI framework, we developed a comprehensive plan to assess the chatbot across six key performance dimensions, then deploy a two-phase testing approach to ensure peak performance, accuracy, and continuous improvement.

Chatbot Testing Performance Dimensions

QK’s chatbot testing framework defines six key areas to assess the quality engineering of AI-powered chatbot performance:

Two-phase Testing Approach

QK tested the functional performance of the Generative AI-powered chatbot using a two-phase testing approach including test data generation and test evaluation.
Phase 1: Test Data Generation
The test data generation phase was divided into 3 different stages:

Data Asset Creation

This step focused on creating a comprehensive dataset to test the Generative AI-powered chatbot. To accomplish this, we created a test dataset that comprised documents with annotated data points on questions, their corresponding expected answers, and validation data created by an AI data specialist. To ensure comprehensive test coverage, we used appropriate LLM models to generate multiple variations of the initial questions.

The human-preferred validation dataset, designed to provide the guardrails for the chatbot to refine its responses, was structured into three distinct classifications:

  • Positive examples showcasing questions with relevant answers in the documents
  • Negative examples showcasing questions with no corresponding answers in the documents
  • Edge cases tested the chatbot’s ability to handle ambiguous, open-ended, or reasoning-based queries.

Context Retrieval & Response Generation

The generated test data (user queries) was leveraged by an AI engineer to interact with the chatbot. For each user query, the AI engineer performed the following activities:

  • Triggered the chatbot to retrieve relevant document chunks (context) from the provided documents
  • Captured the retrieved context for later analysis
  • Instructed the chatbot to generate a response to the query

Test Data Preparation

Conducting the activities to test the chatbot for its retrieval and response efficiencies, the AI engineer complied the test data that included:

  • Original query
  • Retrieved contexts
  • Generated responses
  • Expected answers
Phase 2: Evaluation

The evaluation phase of the AI chatbot testing involved assessing the generated responses using intelligent automation in conjunction with manual verification to ensure scalable, accelerated, and accurate quality engineering.

AI-driven Evaluation

The first step in this phase used an AI model evaluation platform to calculate the following metrics for each validation set:

  • Context Relevance: The metric assessed whether the retrieved document chunks (contexts) were relevant to the query and contained information to answer the question. Using the platform each output was labelled as relevant or irrelevant.
  • Groundedness/Hallucination: This metric determined if the generated response was based on factual information from retrieved contexts or fabricated information created due to AI hallucination. The platform marked each output as factual or hallucinated in this context.
  • Answer Correctness: The metric evaluated whether the generated response matched the expected answer and labeled each output as correct or incorrect.

Manual Validation

The AI platform evaluation was followed by manual validation from our data specialist who reviewed the results for each test case. This process involved verifying the accuracy of the AI-generated evaluation labels, ensuring the reliability of the evaluation process.

Results

At this step of the evaluation, our data specialist and AI engineer came together to analyze the compiled evaluation results to identify patterns and trends in the chatbot’s performance across the various test cases.
The key areas of the analysis focused on benchmarking the AI performance based on:
  • The effectiveness of context retrieval in finding relevant information.
  • The prevalence of factual vs. hallucinatory responses.
  • The accuracy of the generated responses compared to expected answers.
  • The time taken to generate the response.

Our Impact

We conducted 424 tests blending positive, negative, and edge-case user queries to comprehensively evaluate the chatbot’s performance, effectiveness, and accuracy.

Combining the analysis results, we created a comprehensive test report summarizing project findings that highlighted the strengths and weaknesses of the RAG system along with actionable recommendations for improving the chatbot’s performance.

Our comprehensive report helped the insurer identify the following key details about the AI chatbot’s performance:

Recommendations

Based on the evaluation we made the following recommendations to our client:

  1. Improvement in Answer Correctness: The chatbot generated 5.9% incorrect and 5% partially correct answers, missing the industry benchmark of greater than 90% correctness.
  2. Elimination of Hallucination: We identified several instances of AI hallucination during our evaluation, where the chatbot provided fabricated information outside the scope of the provided documents. This creates a serious risk of customer misinformation, impacting satisfaction and potentially resulting in financial losses due to misrepresentation. We recommended our client prioritize the elimination of these hallucinations to mitigate these risks.
  3. Accelerate Query Responses: The chatbot API took 10 seconds to respond to user queries, missing the industry-set benchmark of 5 seconds. We recommended our client reduce the query response time to 5 seconds or lower to avoid user frustration and drop-offs.

This case study showcases how QK’s comprehensive and intelligent quality engineering framework empowered a leading insurance company to failproof their AI chatbot performance. By providing detailed recommendations and insights on key chatbot performance parameters, we helped the client deliver enhanced customer experiences and optimize AI deployment to meet their business goals. The success of the project underscores the need for rigorous evaluation of RAG systems before deployment.

Download a copy of this case study for your files

Want to accelerate your
digital banking
transformation journey?

Contact Us

By submitting this form, you acknowledge and agree to our privacy policy, ensuring the confidential handling of your provided information

With digital penetration skyrocketing in the Middle East, the BFSI industry continues to evolve to meet the changing demands of the digital-first customers in the region. The trend has resulted in exponential growth in digital banking services in the region, with a recent report estimating the sector to have grown at 52% between 2021 and 2023.

Our client, one of the top 10 largest banks in the UAE offering a full range of innovative retail and commercial banking services, wanted to capitalize on the exponentially growing sector in the region and proactively stay ahead of the fast-changing banking landscape. To accomplish its goal, the UAE banking giant was undertaking an IT modernization journey to futureproof its digital ecosystem for high-velocity innovation, enhanced reliability, and user-centric experiences.

Combining the trifecta of proprietary processes, expertise, and technology, QualityKiosk analyzed the bank’s requirements and established a Testing Center of Excellence (TCoE) to enable accelerated quality engineering at scale.

Leveraging an AI-first approach, the TCoE helped the banking giant:

  • Accelerate completion of 35+ digital modernization projects
  • Develop an AI-ready enterprise-wide testing framework  
  • Reduce testing regression times by 70%  
  • Enhance automation penetration by 70%
  • Reduce quality engineering costs by 20%

Download the complete case study today and access the roadmap to enable AI-powered enterprise-wide testing. 

With digital penetration skyrocketing in the Middle East, the BFSI industry continues to evolve to meet the changing demands of the digital-first customers in the region. The trend has resulted in exponential growth in digital banking services in the region, with a recent report estimating the sector to have grown at 52% between 2021 and 2023.

Our client, one of the top 10 largest banks in the UAE offering a full range of innovative retail and commercial banking services, wanted to capitalize on the exponentially growing sector in the region and proactively stay ahead of the fast-changing banking landscape. To accomplish its goal, the UAE banking giant was undertaking an IT modernization journey to futureproof its digital ecosystem for high-velocity innovation, enhanced reliability, and user-centric experiences.

Combining the trifecta of proprietary processes, expertise, and technology, QualityKiosk analyzed the bank’s requirements and established a Testing Center of Excellence (TCoE) to enable accelerated quality engineering at scale.

Leveraging an AI-first approach, the TCoE helped the banking giant:

  • Accelerate completion of 35+ digital modernization projects
  • Develop an AI-ready enterprise-wide testing framework  
  • Reduce testing regression times by 70%  
  • Enhance automation penetration by 70%
  • Reduce quality engineering costs by 20%

Download the complete case study today and access the roadmap to enable AI-powered enterprise-wide testing. 

Download Case Study

By submitting this form, you acknowledge and agree to our privacy policy, ensuring the confidential handling of your provided information

Download Case Study

By submitting this form, you acknowledge and agree to our privacy policy, ensuring the confidential handling of your provided information