Student rating of teaching isessentialfor attaining and maintaining higheducational quality.A quality improvement system, SKURT,based on digital online weekly combined quantitative, ten-graded scale, and qualitative, open-ended free text, group feedback from medical students was developed. Students rated all educational, non-clerkship, items throughout the entire medical program, spanning eleven terms. The results were semi-publicly available for students and faculty at a Swedish university. This study describes datafrom five-year use of the system,focusing on how the use of SKURT influenced educational items found to be in the most substantial need for improvements.
Statistically but hardly practically significant improvement in average feedback grade was found during the observation period (average 7.07 in 2009 to 7.24 in 2013 (p<0.001)).The medical program was already in 2007recognized ascenter of excellent quality in higher education. When analyzing the 18 lectures with lowest outcome in the spring 2009 compared to the fall 2013, five were discontinued. The remaining 13 lectures improved significantly (p<0.001) 116% from 2.94 (SD 0.92) to 6.34 (SD 2.58).
A weekly group feedback system employing the principles used in SKURTis useful forimproving the quality of medical education particularlyby improvingthe items with the lowest ratings.
2. Keywords:Medical Education; Online Evaluation; Problem-based Learning; Quality Improvement; Rating of Teachers; Student Evaluation
A quality improvement system, SKURT,has been in use since 2008at a Swedish university. SKURT is based on digital online weekly combined quantitative, ten-graded scale, and qualitative, open-ended free text, group feedback from medical students. Students rated alleducational, non-clerkship, items throughout the entire medical program, spanning eleven terms. The results were semi-publicly available for students and faculty.The system was created to guide formativeratings, quality enhancement, and educational decisions.
In this paper, we describe the datafrom and consequencesof the use of the system during the five-year period 2009 - 2013. In the first paper published simultaneously we describe the philosophy, technical solutions and practical application of SKURT.
2.1. General Considerations
SKURT was used to gather student ratings of non-clerkship educational items including but not limited tolectures,seminars, group activities and information session. Clerkship sessions are practical training in wards, primary care etc.
Students’ progress through terms, which means that new student cohorts rate recurring items each term. The 5.5-year medical programinfers that a large proportion of students were the same during the analyzed period, but rating educational items different terms.Term 6 was an elective period focused on a research project and without educational items to be rated inSKURT.
Teachers could take part of their individual ratings either through an individual report page or by administrators e-mailing them rating data for lectures through a built-in mass e-mailing or individual item-bound e-mailing function. The individual report page functionality was launched in November 2010 and information about accessing it and its functionality was e-mailed to teachers shortly after launch. Only a single reminder of the functionality was e-mailed to the registered teachers in November 2012.
In the clinical terms 2 lecture weeks are separated by 4 weeksof clinical rotations. The last tutorial group sessions could therefore have preceded some educational items of the last lecture week. Students would then have needed to complete the ratings at the next tutorial group session, after the clinical rotation. This led to a longer time period between item date and rating date for some items.
2.2. Ethical Considerations
Dealing with feedbackisfraught with ethical dilemmas, especially when a component of grading is included.The SKURT feedback was intended to focus on form and content of the educational activity, and not on aspects of the personality of the teachers. The students were informed about this and got feedback on the issue when needed. All feedback was screened before publication.
Before analyzing the data for the present study all data were anonymized.
The initially chosen database structure of SKURT and its practical use limited to some extent the current general analysis of the data. The scheduling functionality, added post-launch, diluted the items meant for rating with items with mainly scheduling purposes. It was also an issue that the terms did not implement the scheduling function at the same time. Lack of clear guidelines, adherence to guidelines and standardization resulted in disparate item entry, rating and grouping on both term administrator and student level. Some tutorial groups rated, without uniformity, items that were not meant for rating. Group activity classes were not sufficiently grouped when scheduling, resulting in inhomogeneous rating practice. These shortcomings represent confounding factors which riskunderestimating e.g. response rates when calculated on aggregated level. Lectures were therefore analyzed separately where relevant, as lecture rating and improvement was the main purpose of SKURT at launch and were the item type with most homogenous rating practice.
All items,ratings and logsbetween 2009-01-01 and 2013-12-31 were exported to Microsoft Excel 2000 format from the MySQLdatabase using PHPMyAdmin. Items were screened for obvious erroneous entries. A total of 410 items marked as copied, hidden and with date before start of semester (indicating items copied but not placed in the new semester), items hidden in schedule without ratings and items lacking either title or teacher and without ratings were removed. 2.5. Excel Functions
All statistics and comparisons were calculated using Microsoft Excel for Mac 2011 standard library of functions .
2.6. Quality Improvement on Item Level
All lectures with average grade below 4.0 points,the spring 2009, were selected and manually tracked the five-year period.
Mean and standard deviation or median and quartiles were used as measures of central tendency and variation, respectively, as appropriate. Two-sided unpaired Student’s t-test was used for significance testing in general. Two-sided paired Student’s t-test was used when comparing quality improvement on item level.
3.1. General Results
Item spanned from 2009-01-19 to 2013-12-20, 13 684 educational items and 71 883ratings were registeredthe period. 37% (5 105) of items were lectures, 35% (4 835) group activities, 12% (1 659) tutorial groups and 15% (2085) others
68% (49 181) of all ratings had an open-ended feedbackand 64% (45 936) had a grade.12% (8 913)of ratings were open-ended feedback without grade whilst 8% (5 668) were graded ratings without feedback.98% (5013) of the 5 105 lectures had at least onerating and 85% (31 090) of lecture ratings had an open-ended feedback.
Rated lectures had an average 7.34ratings with a positive trend in number of ratings per lecture from 6.09 to 8.43, correlating withincrease in the number of medical students and tutorial groups by 27% (722 to 919 students) during the period.
Number of tutorial groups spanned from 8 to 15 the fall 2013 and response rate for lectures ranged from 63%( to 91% with an averagefor all terms of 77% (Figure 1).
The average open-ended feedback was 105 (10th percentile 16, 1stquartile 40, median 81, 3rdquartile 143 and 90th percentile 220) characters. The longest feedbackwas 3 326 characters.
The proportion of positive ratings, grade over 5 points, constituted the overwhelmingmajority of ratings (Figure 2).The average grade per rating was 7.14 (SD 1.73). There was a statistically but perhaps notpractically significant increase from average 7.07 in 2009 to 7.24 in 2013 (p<0.001).
512 of 1 353 educational items the spring 2009 were found with an exact matching title the fall 2013. The remaining 972 educational items the fall 2013 were new, revised or renamed.
The average number of days from item date to rating date was 5.62 (SD 3.44) with 10thpercentile 0, 1st quartile 1, median 3, 3rdquartile 7 and 90th percentile 12 days.
3.2. Screening of Qualitative Feedback
40 (0.06%) feedback responses were edited during the period and all were marked as screened, and published, by at least one term administrator. In five of the terms no editing had taken place and in the remaining five terms there were 4-15 changes made.
A manual analysis of all edited feedback responsesrevealed that 21 were merely spelling errors, 3 were judged as improper editing and the remaining 16 (0.02%) changes were considered in line with the main purpose of the functionality.
629 teachers were assigned at least one educational item. 54% (338) had logged in at least once to their individual rating report page. 1880 logins were registered during the period yielding an average login of 5.6 logins per logged in teacher the six semesters after functionality launch.
10 519 e-mails were sent to 611 unique e-mail addresses belonging to teachers and administrators. 492 of teachers assigned to an item during the period got at least one e-mail summary of students’ ratings.
85% (533) of the teachers with educational items the period had either received an e-mail summary, logged in to their individual page or both.
25 lectures hadan average grade less than 4.0 points the spring 2009. Three received low grade because of cancellation or being called off whilst four were actually not lectures and none of them recurred the other semesters. The remaining 18 lectures evolved as depicted inFigure 3.
The 18 lectures were distributed amongst all terms except term 6 and 10. 92% (943) of ratings included open-ended feedbackwithan average length of 149 characters (10th percentile 34, 1st quartile 71, median 125, 3rd quartile 198, 90th percentile 285 and maximum 1 425 characters).
Lecture 1 had lowest starting grade. Feedback led to an increase in average grade. Feedback included recommendationson how to improve and update the handouts, requests for areas to be clarified and requesting a summary of the most important take home messages related to plans of studyincludingclinical context.
Lecture 2 had an average grade of 3.35 the first three semesters. When the teacher was replaced,including new structure and lecture content, the grade increased to an average 5.94 the following seven semesters. The lecture was a combined lecture and introduction for aseminar assignment. Initial criticism pointed out alow proportion of real lecture contentin comparison to seminar information, which did not recur in later semesters even though lecture title still included the introduction.
Lecture 3 initially received low gradesincludingfeedbackonshortcomings in tutoring style, lecture content and handouts. An additional teacher backing up the first was added and sharedteacher responsibility improved the gradesslightly. The lecture improved more after recruiting a new pair of teachers. The last four semesters only one of themwas assigned the lectureand received positive student feedback and an average grade of 7.76.Feedbackincluded appreciation of a new and revised handout, clinical examples and overall content.
Feedback on the first version of lecture 4 included requesting a more clinical and practical rather than a solely research perspective. The opinionwas swiftlyadopted andthe assigned teacher was changed from a pre-clinical researcher to a clinical physician better suited for aiding the students attaining the learning goals. The initial grade of 2 increased to 9 the following semester and the high gradesweremaintained with an average of 8.55.
The same teacher as in lecture 1 also improved average gradeforlecture 5. Students also here initiallyrequested an updated handout and a clinical perspective on the teaching. The increased average grade from 2.5 to 7.5 was accompanied by compliments, in later semesters, on an appreciated handout. The dip for lecture 5 in the fall 2013 was due to change of teacher that specific semester.
Lecture 6 improved from an initial grade of 2.5 to an average of 7.08in the following semesters. Feedback included that the lecture feltoverloadedwith information in a too short timeframe and students requested either an extension of time or abbreviating the lecture with more focus on term-specific goals.
Lecture 7 was, after six semesters with an average grade of 4.89, discontinued for three semesters and recurredin another term where it was expected to bebettersuited. The lecture was then, mistakenly,held for a class that already had the lecture three terms earlier, as stated by a majority of feedback. The revised lecturedid nonethelessreceive a slightly higher average grade of 5.67.
8 was not repeated after spring 2009. Feedback was not useful, as only a
minority of the groups included either feedback or grade. Two of the feedback
responses stated that the students did not recall the lecture, which could
indicate that the lecture was not held.
Lecture 9 had foursemesters of lowgrades with an average of 2.8. Students gave feedback on teaching style, feeling of too extensive shallow informationin too short time and questioned the clinical applicability. Some students suggested change of teacher and the following semester a new teacher was assigned the lecture. Thegrade rosefor the coming six semesters to an average of 7.7 whilst keeping the same lecture title and purpose.
Lecture 10 had six consecutive semesters of an average grade of 3.61 with the assigned teacher lecturing both alone and together with colleagues. After change of teacher the average gradeincreasedto 7.53 for the last four semesters of the period. Students initially gave feedback on the difficulty of the subject, lack of introduction before dealing with difficult conceptsand on the teacher’s pedagogicskills. The increase in average grade was accompanied with description of the lecture as pedagogical, structured and thorough.
Lecture 11and lecture 12 were rated as irrelevant, unnecessary, lacking new information and misplaced on the term. Both were discontinued after three semesters.
Lecture 13 had a shared teacher design the first five semesters. Student feedbackincluded that the time allocated was too long and that coordination between teachers could have been better as some identical information and cases were repeated during the same lecture. The lecture was split to two separate lectures for the last five semesters with one keeping the title but with only one of the teachers. Students still found that the, now two, lectures were overlapping and one feedbackeven included a suggestion that the two lectures maybe could be combined to one. Only a slight improvement of average grade was noted.
Lecture 14 was discontinued after five semesters as students repeatedly gave feedback that the lecture title was not in line with its content and the lecture lackedpedagogical propertiesand clinical context.
Lecture 15 improved over time and initial feedback from students included lack of time and some factual errors in the handouts. The lecture was split into two parts the last semester and average grade increased slightly from 5.04 the previous semesters to 6.7 (part 1) and 5.0 (part 2) respectively.
Three different teachers were, involved in lecture 16the first five semesters and attained an averagegrade of 4.19. Students gave feedback on lecture layout, presentation style and expressed feelings that the lecture and teacherrun the risk of becominguninspiring.The delicate feedback was though combined with several constructive and detailed suggestions for improvements. The average open-ended feedback was 303 characters the first three semesters and response rate was 93% (100%, 80% and 100% respectively). A newteacher was assigned the lecture the fall 2011 and the lecturethereafter received a stable average grade of 7.6. Students gave feedback that the lecture was valuable, informative, pedagogical and clarifying of an otherwise difficult subject.
Students gave feedback on lack of time and unordered lecture layout and handouts for lecture 17. A lack of time was still experienced by some students in later ratings but the handouts and structure were no longerstated as areas of improvement in the feedback. The lecture was called off two semesters because of unrelated circumstances. Nonetheless the average grade increased over time.
Lecture 18improved drastically to a stable high grade. Studentsinitially gave feedbackthat the lecture was too early in the term, that they lacked the necessary prerequisites and that therefore the lecture was too difficult to comprehend. In coming terms, the lecture was moved to about a month later in the term and thereafter receiving an average of 8.92.
In summary, all recurring lectures improved and 13 were still part of the curriculum the fall 2013.Increase in average grade was associated with a change of teacher in six (46%) of the lectures.
The average grade for the 18 lectures the spring 2009 improved from 3.01 (SD 0.89)to 6.34 (SD1.56)for the remaining 13 lecturesthe fall 2013 (Figure 3).The 116% improvement of the 13 items from 2.94 (SD 0.92) the spring 2009 to 6.34 (SD 2.58) the fall 2013 was significant (p<0.001).
Our results indicate that a weekly group feedback system employing the principles of SKURT can be used to improve the quality of medical education particularly by improving the items withthe most substantial need for improvementsas perceived by the students.
At an aggregated level the already high average grade did improve statistically but hardly practically over the five-year period. The high starting average grade, the medical program already in 2007 recognized as a center of excellent quality in higher education,and that a large number of items were new, revised or renamedthe fall 2013 compared to the spring 2009 could contribute to the lack of largergeneral improvement. A negative correlation between class sizeand rating results have previously been noted but was not evident in the current data
Theeducational items receiving the lowest grades in the spring 2009 improved significantly and practicallyaided by the use of SKURT. Average feedback grades more than doubled and students’ narrative feedback included improvement-focused responses with direct tangible recommendationsthat were adopted.Lectures receiving low feedback grades had high feedback rates and longer feedback length indicating that students werecontributing to improve educational items, which is in line with previous research. When criticisms were harsher the responses were also longer and more elaborate.Five of the low gradeditems were removed from the curriculum and new, revised or renamed items were added instead.
A lack of standardization regarding content and nomenclature is a weakness of the current database structure. It diluted the data with items and ratings not in line with the main purpose of the quality improvement system. Group activities were scheduled and rated in disparate ways in different terms and semesters. Ratings were sometimes unrelated to teacher and item, such as criticism aimed at administrators when items were called off.
The miniscule number of edited feedback, even though all items were marked as screened, could indicate that students generally providedfeedback that was not fraught with focus on the teachers’ personality but rather on substantial issues. The range (0 to 15) of the number of edited ratingseach term however indicatesinhomogeneous use of the function and probably indicatesthat some of the administratorsdidnot screen thoroughlyenough or did not feel that they had permission to edit responses. To further improve feedback quality the new version of SKURTincludes a clearly visible summary of the feedback model, as summarized by Ovando, when students rate. A guideline for administrators’permissions and responsibilities when screening and editing feedback is under development. A report button in association with each feedback could further help identify feedback not following guidelines, as unconstructive comments are a risk noted with student ratings.
The possibility to assign new teachers to educational itemssuffering fromlow grades represents a powerful tool for improvements. In other curricula, commonly based on few teachers assigned to weeklong courses, change of teacher is unlikely and seldom possible. In this case, application of the principles SKURT is built oncould then instead guide teachers’ professional developmentand improve practical class pedagogical methods, as was seen in several of the low graded items, with timely feedback continuously during the course enabling a recurrent open feedback loop with continuous improvement of quality[12-14].
The mean duration of timefrom the educational activity until feedback is given indicates that SKURT is usedas regularly as intended. The short time period ensures an effective feedback loop.The item-specific feedback enables development of different pedagogical methodsand structures as well as optimizing the recruitment of teachers to the activities. Changes based on specific tangible feedbackcan be channeled into better horizontal and vertical integration throughout the program and also to better constructive alignment. Even moving lectures, seminars or practical learning activities a month or two in time, to fit better with the learning curve, can result in major improvements.
The high response rates for lecture ratings were without systematic individual or group reminders and with voluntary ratings, even though tutorial groups were mandatory.The format also seems to counteract factors such as forgetting and lack of time noted in previous studies[9,12,13,15,16].The response rate was 25 percentage points higher than the medical programsaverage response rate 2007-2012 on the university wide end-term online rating system.It should, however, be noted that the end-term online rating system is based on individual students’ ratings and not on group ratings. Weekly ratings and the online format did not result in overuse of students’interest andmotivationwith accompanying low response rates as previously have been noted and feared[18-22].The students have the option of giving feedback, grading or both. A majority of ratings received an open-ended feedback and fears from faculty that students would by routine only grade items and not write feedback were unrealized.
The integration of SKURT in the campus culture, involvement of student organizations, improvements based on rating data and student’s wide-spread notice of their feedback importance[12,18,21,23-25]could be factors promoting the high response rates. Students soon started using a custom-created verb, “Skurta”, with the meaning “Using SKURT”, which illustrates the incorporation in campus culture.
Areas of improvements include clearer guidelinesand more rigorous control of nomenclature, data structures and input of both items and ratings.Building in reminders of uncompleted ratings could increase response rates further.Other options, than giving unjustified low ratings including frustration over called off items, such as a weekly “General Feedback”-option would increase validity of item ratings.A push-notification system for teacherswhen new ratings are published would increase the usage of the individual ratings page.Manual cleansing and standardization of the databasecould enable further aggregated analysis of the data.
In summarythe principles applied in SKURT which conveniently can manage vast amounts of data, have been an important tool in improving low graded educational items by providing item specific timely feedback from students. The response rates were high and no signs of wear and tear on the students caused by too much feedback activities were observed.
David Sinkvist programmed SKURT and administered all server applications. The three other authors of this manuscript constituted a project group throughout the entire duration of the project, and as heads of the medical program (TL, AT) communicated with teachers (ET, TL, AT), students (DS, ET, TL, AT), institutions (ET) and the Faculty (ET, TL, AT)
6. Declaration of Interest
The intellectual rights and the copyright to SKURT belong to David Sinkvist
Figure 1: Lecture response rates (left axis) for different terms and number of tutorial groups (right axis) the fall semester 2013.
Figure 2: Distributions of rating grades amongst rated items per semester.
Figure 3: Evolution of lectures with average grade below 4.0 points the spring 2009 including average grade for all 18 lectures.