The Kirkpatrick Model of Evaluation, first developed by Donald Kirkpatrick in 1959, is the most popular model for evaluating the effectiveness of a training program. The model includes four levels of evaluation, and as such, is sometimes referred to as 'Kirkpatrick's levels" or the "four levels."
This article explores each level of Kirkpatrick's model and includes real-world examples so that you can see how the model is applied.
If at any point you have questions or would like to discuss the model with practitioners, then feel free to join my eLearning + instructional design Slack channel and ask away.
The Kirkpatrick Model of Evaluation is a popular approach to evaluating training programs. However, despite the model focusing on training programs specifically, it's broad enough to encompass any type of program evaluation.
For all practical purposes, though, training practitioners use the model to evaluate training programs and instructional design initiatives. It covers four distinct levels of evaluation:
As you move from levels 1 through 4, the evaluation techniques become increasingly complex and the data generated becomes increasingly valuable.
Due to this increasing complexity as you get to levels 3 and 4 in the Kirkpatrick model, many training professionals and departments confine their evaluation efforts to levels 1 and 2. This leaves the most valuable data off of the table, which can derail many well intended evaluation efforts.
Finally, if you are a training professional, you may want to memorize each level of the model and what it entails; many practitioners will refer to evaluation activities by their level in the Kirkpatrick model.
If you're in the position where you need to evaluate a training program, you should also familiarize yourself with the techniques that we'll discuss throughout the article.
Now it's time to dive into the specifics of each level in the Kirkpatrick Model.
We move from level 1 to level 4 in this section, but it's important to note that these levels should be considered in reverse as you're developing your evaluation strategy. We address this further in the 'How to Use the Kirkpatrick Model' section.
Reaction data captures the participants' reaction to the training experience. Specifically, it refers to how satisfying, engaging, and relevant they find the experience.
This is the most common type of evaluation that departments carry out today. Training practitioners often hand out 'smile sheets' (or 'happy sheets') to participants at the end of a workshop or eLearning experience. Participants rate, on a scale of 1-5, how satisfying, relevant, and engaging they found the experience.
Level 1 data tells you how the participants feel about the experience, but this data is the least useful for maximizing the impact of the training program.
The purpose of corporate training is to improve employee performance, so while an indication that employees are enjoying the training experience may be nice, it does not tell us whether or not we are achieving our performance goal or helping the business.
With that being said, efforts to create a satisfying, enjoyable, and relevant training experience are worthwhile, but this level of evaluation strategy requires the least amount of time and budget. The bulk of the effort should be devoted to levels 2, 3, and 4.
As discussed above, the most common way to conduct level 1 evaluation is to administer a short survey at the conclusion of a training experience. If it's an in-person experience, then this may be conducted via a paper handout, a short interview with the facilitator, or an online survey via an email follow-up.
If the training experience is online, then you can deliver the survey via email, build it directly into the eLearning experience, or create the survey in the Learning Management System (LMS) itself.
Common survey tools for training evaluation are Questionmark and SurveyMonkey.
Let's consider two real-life scenarios where evaluation would be necessary:
In the call center example, imagine a facilitator hosting a one-hour webinar that teaches the agents when to use screen sharing, how to initiate a screen sharing session, and how to explain the legal disclaimers. They split the group into breakout sessions at the end to practice.
At the conclusion of the experience, participants are given an online survey and asked to rate, on a scale of 1 to 5, how relevant they found the training to their jobs, how engaging they found the training, and how satisfied they are with what they learned. There's also a question or two about whether they would recommend the training to a colleague and whether they're confident that they can use screen sharing on calls with live customers.
In the coffee roasting example, imagine a facilitator delivering a live workshop on-site at a regional coffee roastery. He teaches the staff how to clean the machine, showing each step of the cleaning process and providing hands-on practice opportunities.
Once the workshop is complete and the facilitator leaves, the manager at the roastery asks his employees how satisfied they were with the training, whether they were engaged, and whether they're confident that they can apply what they learned to their jobs. He records some of the responses and follows up with the facilitator to provide feedback.
In both of these examples, efforts are made to collect data about how the participants initially react to the training event; this data can be used to make decisions about how to best deliver the training, but it is the least valuable data when it comes to making important decisions about how to revise the training.
For example, if you find that the call center agents do not find the screen sharing training relevant to their jobs, you would want to ask additional questions to determine why this is the case. Addressing concerns such as this in the training experience itself may provide a much better experience to the participants.
Learning data tells us whether or not the people who take the training have learned anything. Specifically, it helps you answer the question: "Did the training program help participants learn the desired knowledge, skills, or attitudes?".
Level-two evaluation is an integral part of most training experiences. Assessment is a cornerstone of training design: think multiple choice quizzes and final exams.
This data is often used to make a decision about whether or not the participant should receive credit for the course; for example, many eLearning assessments require the person taking it to score an 80% or above to receive credit, and many licensing programs have a final test that you are required to pass.
Finally, while not always practical or cost-efficient, pre-tests are the best way to establish a baseline for your training participants. When you assess people's knowledge and skills both before and after a training experience, you are able to see much more clearly which improvements were due to the training experience.
While written or computer-based assessments are the most common approach to collecting learning data, you can also measure learning by conducting interviews or observation.
For example, if you are teaching new drivers how to change a tire, you can measure learning by asking them to change a tire in front of you; if they are able to do so successfully, then that speaks to the success of the program; if they are not able to change the tire, then you may ask follow-up questions to uncover roadblocks and improve your training program as needed.
However, if you are measuring knowledge or a cognitive skill, then a multiple choice quiz or written assessment may be sufficient. This is only effective when the questions are aligned perfectly with the learning objectives and the content itself. If the questions are faulty, then the data generated from them may cause you to make unnecessary or counter-intuitive changes to the program.
Carrying the examples from the previous section forward, let's consider what level 2 evaluation would look like for each of them.
For the screen sharing example, imagine a role play practice activity. Groups are in their breakout rooms and a facilitator is observing to conduct level 2 evaluation. He wants to determine if groups are following the screen-sharing process correctly.
A more formal level 2 evaluation may consist of each participant following up with their supervisor; the supervisor asks them to correctly demonstrate the screen sharing process and then proceeds to role play as a customer. This would measure whether the agents have the necessary skills.
The trainers may also deliver a formal, 10-question multiple choice assessment to measure the knowledge associated with the new screen sharing process. They may even require that the agents score an 80% on this quiz to receive their screen sharing certification, and the agents are not allowed to screen share with customers until passing this assessment successfully.
In the industrial coffee roasting example, a strong level 2 assessment would be to ask each participant to properly clean the machine while being observed by the facilitator or a supervisor. Again, a written assessment can be used to assess the knowledge or cognitive skills, but physical skills are best measured via observation.
As we move into Kirkpatrick's third level of evaluation, we move into the high-value evaluation data that helps us make informed improvements to the training program.
Level 3 evaluation data tells us whether or not people are behaving differently on the job as a consequence of the training program. Since the purpose of corporate training is to improve performance and produce measurable results for a business, this is the first level where we are seeing whether or not our training efforts are successful.
While this data is valuable, it is also more difficult to collect than that in the first two levels of the model. On-the-job measures are necessary for determining whether or not behavior has changed as a result of the training.
Reviewing performance metrics, observing employees directly, and conducting performance reviews are the most common ways to determine whether on-the-job performance has improved.
As far as metrics are concerned, it's best to use a metric that's already being tracked automatically (for example, customer satisfaction rating, sales numbers, etc.). If no relevant metrics are being tracked, then it may be worth the effort to institute software or a system that can track them.
However, if no metrics are being tracked and there is no budget available to do so, supervisor reviews or annual performance reports may be used to measure the on-the-job performance changes that result from a training experience.
Since these reviews are usually general in nature and only conducted a handful of times per year, they are not particularly effective at measuring on-the-job behavior change as a result of a specific training intervention. Therefore, intentional observation tied to the desired results of the training program should be conducted in these cases to adequately measure performance improvement.
Therefore, when level 3 evaluation is given proper consideration, the approach may include regular on-the-job observation, review of relevant metrics, and performance review data.
Bringing our previous examples into a level 3 evaluation, let's begin with the call center. With the roll-out of the new system, the software developers integrated the screen sharing software with the performance management software; this tracks whether a screen sharing session was initiated on each call.
Now, after taking the screen sharing training and passing the final test, call center agents begin initiating screen sharing sessions with customers. Every time this is done, a record is available for the supervisor to review.
On-the-job behavior change can now be viewed as a simple metric: the percentage of calls that an agent initiates a screen sharing session on. If this percentage is high for the participants who completed the training, then training designers can judge the success of their initiative accordingly. If the percentage is low, then follow-up conversations can be had to identify difficulties and modify the training program as needed.
In the coffee roasting example, the training provider is most interested in whether or not their workshop on how to clean the machines is effective. Supervisors at the coffee roasteries check the machines every day to determine how clean they are, and they send weekly reports to the training providers.
When the machines are not clean, the supervisors follow up with the staff members who were supposed to clean them; this identifies potential road blocks and helps the training providers better address them during the training experience.
Level 4 data is the most valuable data covered by the Kirkpatrick model; it measures how the training program contributes to the success of the organization as a whole. This refers to the organizational results themselves, such as sales, customer satisfaction ratings, and even return on investment (ROI). (In some spinoffs of the Kirkpatrick model, ROI is included as a fifth level, but there is no reason why level 4 cannot include this organizational result as well).
Many training practitioners skip level 4 evaluation. Organizations do not devote the time or budget necessary to measure these results, and as a consequence, decisions about training design and delivery are made without all of the information necessary to know whether it's a good investment.
By devoting the necessary time and energy to a level 4 evaluation, you can make informed decisions about whether the training budget is working for or against the organization you support.
Similar to level 3 evaluation, metrics play an important part in level 4, too. At this level, however, you want to look at metrics that are important to the organization as a whole (such as sales numbers, customer satisfaction rating, and turnover rate).
If you find that people who complete a training initiative produce better metrics more than their peers who have not completed the training, then you can draw powerful conclusions about the initiative's success.
A great way to generate valuable data at this level is to work with a control group. Take two groups who have as many factors in common as possible, then put one group through the training experience. Watch how the data generated by each group compares; use this to improve the training experience in a way that will be meaningful to the business.
Again, level 4 evaluation is the most demanding and complex — using control groups is expensive and not always feasible. There are also many ways to measure ROI, and the best models will still require a high degree of effort without a high degree of certainty (depending on the situation).
Despite this complexity, level 4 data is by far the most valuable. This level of data tells you whether your training initiatives are doing anything for the business. If the training initiatives are contributing to measurable results, then the value produced by the efforts will be clear. If they are not, then the business may be better off without the training.
In our call center example, the primary metric the training evaluators look to is customer satisfaction rating. They decided to focus on this screen sharing initiative because they wanted to provide a better customer experience.
If they see that the customer satisfaction rating is higher on calls with agents who have successfully passed the screen sharing training, then they may draw conclusions about how the training program contributes to the organization's success.
For the coffee roastery example, managers at the regional roasteries are keeping a close eye on their yields from the new machines. When the machines are clean, less coffee beans are burnt.
As managers see higher yields from the roast masters who have completed the training, they can draw conclusions about the return that the training is producing for their business.
Now that we've explored each level of the Kirkpatrick's model and carried through a couple of examples, we can take a big-picture approach to a training evaluation need.
Consider this: a large telecommunications company is rolling out a new product nationwide. They want to ensure that their sales teams can speak to the product's features and match them to customer's needs — key tasks associated with selling the product effectively.
An average instructional designer may jump directly into designing and developing a training program. However, one who is well-versed in training evaluation and accountable for the initiative's success would take a step back.
From the outset of an initiative like this, it is worthwhile to consider training evaluation. Always start at level 4: what organizational results are we trying to produce with this initiative?
In this example, the organization is likely trying to drive sales. They have a new product and they want to sell it. Let's say that they have a specific sales goal: sell 800,000 units of this product within the first year of its launch.
Now the training team or department knows what to hold itself accountable to.
From there, we consider level 3. What on-the-job behaviors do sales representatives need to demonstrate in order to contribute to the sales goals? Working with a subject matter expert (SME) and key business stakeholders, we identify a list of behaviors that representatives would need to exhibit.
Now we move down to level 2. What knowledge and skills do employees need to learn to ensure that they can perform as desired on-the-job? We can assess their current knowledge and skill using surveys and pre-tests, and then we can work with our SMEs to narrow down the learning objectives even further.
Finally, we consider level 1. How should we design and deliver this training to ensure that the participants enjoy it, find it relevant to their jobs, and feel confident once the training is complete?
You can also identify the evaluation techniques that you will use at each level during this planning phase. You can map exactly how you will evaluate the program's success before doing any design or development, and doing so will help you stay focused and accountable on the highest-level goals.
When it comes down to it, Kirkpatrick helps us do two things: understand our people and understand our business. What do our employees want? What are their anxieties? What's holding them back from performing as well as they could?
As far as the business is concerned, Kirkpatrick's model helps us identify how training efforts are contributing to the business's success. This is an imperative and too-often overlooked part of training design. If the training initiatives do not help the business, then there may not be sufficient reason for them to exist in the first place.
If you'd like to discuss evaluation strategy further or dive deeper into Kirkpatrick's model with other practitioners, then feel free to join the ID community.