NOTE: Science Pirates (first released in 2008) is currently being updated for newer operating systems. Watch a video about the game, which was used with middle school students to help them learn about forming and testing science hypotheses.

Songs from Science Pirates are available on YouTube and NMSU's iTunes U.



The purpose of the external evaluation of the Science Pirates educational game was to investigate what impact it had on student learning in the areas of handwashing and science inquiry.

Science Pirates: The Curse of Brownbeard allows game players to explore an island, observing pirates poor handwashing behavior, and come to understand why pirates believe they are cursed with a “sickness of the bowels”. The end game activity allows players to design and conduct an experiment to test the best handwashing method, interpret the results, and make recommendations to the good captain. While exploring the island in search of three map pieces to Captain Brownbeard’s hideout, the game player engages in hypothesis design in the Super Tiki Monkey Temple; explores and identifies the differences between independent variables, dependent variables, and constants; and reduces experimentation to one variable at a time in the Powder Monkey. In addition, vocabulary and basic scientific content is presented in engaging ways.


The Science Pirates game was field tested in fall of 2007 with a total of 585 middle school students in the San Francisco Bay Area of California. WestEd recruited six teachers who agreed to assign the game as an in-class or homework activity within their classes. The six teachers had a total of 22 classes between them. Before beginning the use of the Science Pirates game, students completed a paper and pencil pre-test of their knowledge in the health and science topics covered by the game.

The pre and posttest instruments

The items for the pre and posttests were drawn from released test items from various state and national tests. All items were selected response items so that dichotomous scores (1=correct, 0=incorrect) could be assigned in scoring, since the budget for the evaluation was not sufficient to include constructed response items that would have to have been scored by a team of content experts. Thirty-one items were selected, 13 that assessed knowledge of handwashing and food safety and 18 that assessed science inquiry skills concerning design of experiments, dependent and independent variables, and controlling variables. The 31 items were assembled into two test forms, each containing 13 unique items and 5 common items that appeared on each form, making a total of 18 items per form. The common items were used to statistically link the two forms so that student performance could be measured before and after the use of the Science Pirates game. The linking of the forms using the common items is described in more detail in the analysis section. The pre and posttest forms are shown in Appendix A.

Reliability of the pre and posttest instruments

The item characteristics of the test forms as a whole were measured by applying an Item Response Theory (IRT) simple Rasch model to the response data for the pretest and the posttest. The analysis program used, Conquest, also generated classical test theory statistics in the form of p-values, point biserials for each item, as well as item fit statistics in the IRT model.

The posttest item estimates were obtained first. Three items that had poor characteristics were deleted from the posttest data analysis, as they were not measuring the construct effectively. Items were deleted because they had high p-values (number of students getting the item correct), low discrimination (measured by point biserials) or had fit statistics that were two or more measures from zero in the IRT analysis. With these items removed the reliability of the posttest instrument as measured with the Kuder-Richardson 20 (K-R 20) analysis was improved to 0.72, which is a moderate level of reliability. The mean score was 10.93 (SD 2.85) on a 15-point scale and the Standard Error of Measurement was 1.51.

The IRT item parameter estimates obtained for the five items that were common to both forms were exported from the posttest analysis and then imported in the analysis of the pretest data. The item estimates for the pretest items were then “anchored” to the values of these five items. This has the effect of ensuring that item parameters for the pretest items and the posttest items were anchored to the same 5 items that appeared on both forms. This means that student performance on the pretest and the posttest were measured on the same metric even though students took different items each time (except the 5 common items).

In the analysis of the pretest, the reliability of the test instrument was 0.70, a moderate level of reliability. The mean score was 12.76 (SD 3.15) on an 18-point scale and the Standard Error of Measurement was 1.73.

From the IRT analyses estimates of the ability level of each student was obtained from their pretest data and from their posttest data and the gain in scores between the pre and the posttest was calculated. The gain scores were measured in logits, a log-of-the-odds measure obtained from the IRT analyses. Mean gain scores were then calculated for the whole group of students and then for the students divided into three equal groups representing those who scored in the highest third, those in the middle third and those in the lowest third of performance on the pretest (i.e., their beginning ability).


Of the total 585 students for whom there was complete pre and posttest data, an initial analysis revealed that there were ten students whose gain scores were outliers and these were removed from the dataset before conducting the analyses to compare the group means.

The mean gain score for all 575 students was 0.18 (SD .68) logits and the difference was statistically significant (p<. 001). To judge the order of magnitude of this gain, an effect size was calculated, and the effect size was .20, which is a small effect of the magnitude often seen in educational interventions.

However, beyond examining the learning gains for the group of students as a whole, we were interested in whether the Science Pirates game might be beneficial to students who might find the material more challenging in other settings, so we analyzed the data by dividing the students into those who scored High, Medium or Low on the pretest. In other words we conducted analysis of the different gain scores for the top third, middle third and lower third of the students in the study. This revealed some interesting findings.

First, a one-way Analysis of Variance (ANOVA) was conducted to see if there was a statistically significant effect between the High, Medium and Low beginning ability groups. This revealed that there was significant effect for between groups (F=12.98, df=2, p<. 001) and so we proceeded to investigate the differences between the gain scores for the High, Medium and Low beginning ability groups.

Table 1 shows the mean gain scores for the students divided by High, Medium and Low Beginning ability (as measured by the pretest) and to illustrate it more graphically, Figure 1 shows a box plot of the gain scores. It is notable that there was no statistically significant difference between pre and posttest scores for the High Beginning Ability group. However, for the Medium and Low Beginning ability groups there were statistically significant improvements in performance between the pretest and the posttest. In fact, a comparison of the mean gain scores among the High, Medium and Low groups showed that there was a statistically significant difference between the gain scores for Medium and Low compared to the High group.

Analyzing the gains for the Low and Medium ability groups shows that they represent effect sizes of .53 for the Low beginning ability students and .41 for the Medium beginning ability group, both of which are medium effect sizes. This indicates that the Science Pirates Game had beneficial effects on the knowledge of handwashing and of science inquiry skills for lower performing students but did not make any difference for higher ability students.

Performance on the Food Safety and Handwashing Hygiene Items

The pretest contained five items that addressed food safety and handwashing hygiene items and there were six such items on the posttest. This number of items was insufficient to run a reliable analysis using Item Response Theory and so raw scores were used. However, one item on the pretest and one item on the posttest were excluded from the subscale because they were very easy for students and so had low variability, resulting in low item discrimination values. For the the pretest food safety/hygiene subscale that comprised four items, the reliability was low (Cronbach’s alpha = .19) and so was reliability of the subscale made up of five items on the posttest (Cronbach’s alpha = .22). These low internal reliability statistics are the result of having so few items on the subscale.

The mean raw score on the food safety/hygiene subscale was 3.01 (75%) for the pretest and 4.23 (85%) on the posttest, with a mean standardized gain score of 10%. It is evident from the fact that students did well even on the pretest food safety/hygiene items that this group of middle school students already had a high degree of knowledge about food safety and handwashing hygiene before they used the Science Pirates game. Given the low reliability of the subscale, it is not meaningful to interpret this gain as being practically significant.

Table 1. Mean Gain Scores Classified by Beginning Ability
95% Confidence Interval for Mean
N Mean Std. Deviation Std. Error Lower Bound Upper Bound Minimum Maximum
Low Beginning Ability 190 .33 .69 .05 .23 .43 -1.34 2.04
Medium Beginning Ability 193 .23 .74 .05 .12 .33 -1.94 1.57
High Beginning Ability 192 -.01 .56 .04 -.09 .07 -1.40 1.25
Total 575 .18 .68 .03 .13 .24 -1.94 2.04

Boxplot of the Gain Scores for Students Classified by Beginning Ability

Figure 1. Boxplot of the Gain Scores for Students Classified by Beginning Ability


The use of the Science Pirates game improved student learning of handwashing and science inquiry skills overall. While the effect size was only .20 for the group of students as a whole, students who had a low or medium ability level at the start improved significantly better than the students who had a high initial ability. Students with high beginning ability showed no significant gains, but students with low or medium beginning ability improved significantly, with medium effect sizes of .58 and .41 respectively. In summary, the Science Pirates game seemed particularly effective for students who were in the low and medium ability ranges at the start of the study.