Journey Of Training AI For Software Testing

Series 1 – Finding LLM 


In an era of unprecedented software complexity, the quest for efficient and precise software testing methods has become more critical than ever. Large Language Models (LLMs) like GPT-4, Gemini, and LLaMA2 have made significant contributions across various sectors due to their extensive capabilities. However, the specialized demands of software testing necessitate a more tailored approach—a tool finely aligned with the intricate requirements of an organization’s testing processes.

The Need for Customization

Today’s enterprises seek more than just tools; they desire extensions of their workflows—systems that comprehend the subtleties of their projects and adapt to their changing needs dynamically. This necessitates the development of specialized LLMs for software testing, designed not to replace but to enhance and build upon the foundations laid by existing LLMs, providing unparalleled fluency in the language of software testing.

Building a Customized LLM

Selection and Curation

The first step towards creating a tailored LLM involves selecting a robust yet flexible base model. There are different variations of LLMs that can be utilized based on requirement:

General-purpose LLMs – Aimed for wide range of tasks.

Domain-specific LLMs – Trained on specific data, performs better on domain specific tasks. 

Multilingual LLMs – Trained on multiple languages for cross lingual tasks.

Few-shot LLMs – Performs well even when fine-tuned with small amounts of data. 

Task-specific LLMs – Tailored for specific tasks such as summarization, translation. 

For most software testing platform, a domain specific LLM will be ideal, but exact model choice depends on the integration plans. Following this, a comprehensive dataset that mirrors the variety and complexity of software testing scenarios is required, laying the groundwork for the model’s training.

Training Methodologies

The essence of this endeavor is the fine-tuning process, where the model is rigorously trained on the chosen dataset to gain a deep understanding of software testing terminologies and methodologies. This phase is characterized by prompt engineering, few-shot learning, and iterative training sessions, all aimed at integrating the model seamlessly into the software testing framework, thereby enhancing its ability to generate test cases, interpret results, and predict potential issues with remarkable accuracy.

Understanding LLM Sizes and Parameters

Understanding the scale and intricacies of Large Language Models (LLMs) is essential, as their size significantly influences their ability to discern and generate complex patterns, thereby enhancing their understanding and generative capabilities. The parameter counts for models like GPT-4, Gemini, and LLaMA 2 having over 1 trillion parameter. For instance, GPT-4, developed by OpenAI, is known to have 1.7 trillion parameters.

These models, often referred to as “heavyweight” due to their extensive parameter count, offer superior performance in various tasks, including software testing, by capturing more nuanced data patterns. On the other end of the spectrum, “lightweight” LLMs, which can range from 50 million to several billion parameters, can provide a more resource-efficient alternative while still delivering significant capabilities, albeit with some trade-offs in depth and nuance.

The physical storage size of these models, which can span several gigabytes, is another critical aspect, directly impacting the computational resources required for their operation. This includes not just the storage space but also the processing power and memory needed to train and run these models effectively.

Hardware Requirements 

The process of training and deploying a Large Language Model (LLM) demands significant computational resources, encompassing high-performance workstation GPUs, substantial RAM, and robust processing capabilities. The scale of the LLM in question plays a pivotal role in determining the required infrastructure; for instance, orchestrating a model of considerable magnitude might necessitate a configuration with a quartet of advanced workstation GPUs, potentially escalating the investment to upwards of $100,000, contingent upon the specific GPU models selected.

As an alternative to procuring physical hardware, cloud-based services offer a flexible and scalable solution, enabling access to cutting-edge computational resources on a rental basis. This model’s financial implications hinge on the desired performance level and the quantum of hardware employed, with rental rates for cloud GPUs spanning a spectrum from approximately $0.80 per hour for entry-level options to $5 per hour for the most advanced and current units.

In a practical context, the operational expenses for a modestly sized LLM, equipped with 7 to 15 billion parameters, may approach $3,000 monthly, when ran continously on a cluster of GPUs. Conversely, the costs associated with managing a more expansive LLM, featuring 100 to 200 billion parameters, could reach the vicinity of $20,000 per month, when ran continously on a cluster of GPUs. These financial considerations play a crucial role in strategizing the deployment of an LLM, underlining the importance of a judicious balance between computational needs and budgetary constraints.

Privacy and Data Security

Incorporating LLMs into software testing necessitates stringent privacy measures, especially when handling sensitive data. Techniques like data anonymization and secure data sharing protocols are indispensable for maintaining data integrity and adhering to privacy regulations.

Using APIs From Existing LLMs

For organizations where the development of a proprietary Large Language Model (LLM) for software testing is impractical, utilizing OpenAI’s APIs offers a practical solution. OpenAI provides commercial licenses for their comprehensive suite of LLMs, including the advanced GPT-4, which can be customized for specific domains at a cost-effective rate ranging from $0.01 to $0.12 per 1,000 tokens prompted or sampled. This option grants access to a state-of-the-art LLM at operating costs that are comparatively affordable. Moreover, it facilitates the seamless incorporation of LLM functionalities into testing processes, eliminating the need for complex model training and infrastructure management. This strategy ensures effective data management while upholding the utmost standards of privacy and security, which are essential for delicate testing operations.


The development of specialized LLMs for software testing represents a visionary leap towards a future where software quality assurance is not just efficient but also inherently aligned with the unique requirements of each project. In this rapidly evolving landscape, the implementation of customized LLMs in software testing is not merely promising but achievable with the methodologies present, cultivating a journey to building your own testing LLM.



Paul·, J. (2023, March 30). Natural language processing in software testing – dzone. 

Wang, J., Huang, Y., Chen, C., Liu, Z., Wang, S., & Wang, Q. (2024, January 5). Software testing with large language models: Survey, landscape, and Vision. 

more, A. (2023, December 4). Software testing with large language model. Medium. 

Building Data Collection & Processing Strategy for LLM training. Building Data Collection & Processing Strategy for LLM Training. (n.d.). 

Drost, D. (2023, July 21). Different ways of training llms. Medium. 

Mohammed. (2023, July 18). Types of open source & closed source LLMS(large language models). Medium. 

What is Rag? – retrieval-augmented generation explained – AWS. (n.d.).