About this Digital Document
Develop a LLM model evaluation guide, framework, and workshop to promote AI and prompt engineering literacy at Lehigh University.
The rapidly evolving AI landscape presents continuous advancements in performance, features, and functionality. Vendors frequently release new updates, often outpacing each other within a
span of months. For instance, even within a single vendor like OpenAI, the release cadence is challenging to follow.
Although vendors provide performance charts for their models, these metrics may not accurately reflect real-world usage. This creates a challenge for faculty, researchers, and staff, such as Library & Technology Services, who must keep abreast of these changes to identify the best model for specific projects or to re-evaluate existing projects.
To address this challenge, it is crucial to establish a systematic approach for tracking and evaluating AI model updates and their practical applications. This will enable informed decision-making and ensure the optimal selection of AI models for various academic, research, and staff needs.
The proposed solution is to develop a LLM evaluation framework that can be used
against various models, document new features, keep track of performance
and the quality of the responses. Look to develop a tool that can capture test
prompts, run those prompts, collect analysis data, define quality metrics, and
store for historical references. Provide support for both human and LLM graders.
The delivery includes the following:
1. Process framework to help provide guidelines in evaluating and selecting LLMs to be used for academics, research, or general staff usage.
2. Tool that can be used to develop prompts for testing and evaluation to determine performance and response quality, including hallucinations.
3. Provide documentation on the framework, tool, and offer prompt engineering and evaluation workshops as needed.
The goal is to assist the Lehigh community in evaluating LLMs in this rapidly changing environment to make informed and optimal selection for their projects or usage. In addition, promote a deeper understanding of prompt engineering, a crucial skill for effectively interacting with AI models.