A Methodological Review of the Program Evaluations in K-12 Computer Science Education ∗

Because of the potential for methodological reviews to improve practice, this article presents the results of a methodological review, and meta-analysis, of kindergarten through 12th grade computer science education evaluation reports published before March 2005. A search of major academic databases, the Internet, and a query to computer science education researchers resulted in 29 evaluation reports that met stringent criteria for inclusion. Those reports were coded in terms of their demographic characteristics, program characteristics, evaluation characteristics, and evaluation findings. It was found that most of the programs offered direct computer science instruction to North American high school students. Stakeholder attitudes, program enrollment, academic achievement in core courses, and achievement in computer science courses were the most frequently measured outcomes. Questionnaires, existing sources of data, standardized tests, and teacher- or researcher- made tests were the most frequently used types of measures. Based on eight programs that offered direct computer science instruction, the average increase on tests of computer science achievement over the course of the program was 1.10 standard deviations, or the statistical equivalent of 73 out of 100 program participants having shown improvement. Some of the main challenges for the evaluation of computer science education programs are the absence of standardized, reliable, and valid measures of K-12 computer science education and coming to understand the causal links between program activities, gender, and program outcomes.


Introduction
There are both economic and social needs for high-quality kindergarten through 12th grade (K-12) computer science education.The U.S. Department of Labor, Bureau of Labor Statistics, projects that the "employment of computer specialists is expected to grow much faster than average for all occupations as organizations continue to adopt and integrate increasingly sophisticated technologies" (2004).K-12 computer science education helps prepare individuals to attain advanced computing degrees, which, in turn, help those individuals meet the rapidly changing technological needs of business and industry.Even for those who do not intend to go into computing as a profession, some degree of computing skill and knowledge will be necessary to meaningfully participate in the technologically oriented societies of the future (Breslin, 1990;The National Research Council Committee on Information Technology Literacy, 1999).
The SIGCSE Working Group on Evaluation (Almstrum et al., 1996) pointed out that there are many groups who stand to gain from the practice of evaluation 1 : those "constituencies that stand to benefit from what we [computer science educators] learn [from evaluation] include ourselves, our community of colleagues, and society as a whole.The ultimate beneficiaries of our learning, however, are our students" (p.202).Some of the reasons that Almstrum et al. give for conducting evaluations of computer science education programs 2 are presented below: • to satisfy our curiosity about what works and what doesn't; • to discover issues of importance to ourselves and our students; • inform our course development process; • to compare alternatives; • to help to identify important factors in a complex phenomenon; • to gain the ability to make informed decisions; • confirm or refute conventional wisdom; • justify actions with cost/benefit analysis; • validate research proposed to outside sources (p.202).
Conducting a review of program evaluations is as necessary to evaluation as conducting a high quality literature review is to research.A review of previous evaluations helps evaluators get acquainted with the contexts and issues in their program's field, it familiarizes them with the research designs and measures being used by their peers, it helps identify key variables, and it can indicate what the expected results of a particular type of program should be.A review can also be an indicator of the state of the research and, thereby, motivate evaluators to keep doing what they do well and rectify what they do not do so well.Moreover, systematic reviews of program evaluations also benefit policy makers directly by synthesizing information that is needed for informed decision making (Carvalho and White, 2004;Cooper and Hedges, 1994; Joint Committee on Standards for Educational Evaluation, 1994;Weiss, 1998).
There have been several methodological reviews of computer science education research (see, e.g., Randolph, 2007a;Randolph et al., 2005a;and Valentine, 2004) and reviews of resources for evaluating programs in computer science education (Randolph and Hartikainen, 2004).However, there have been no previous systematic reviews of the program evaluations in K-12 computer science education.
1 Throughout this article, because of the unresolved debate regarding what should be considered evaluation and what should be considered research, I consider an investigative activity to be evaluation, rather than research, if the investigators state that they are doing evaluation, rather than research.In general, I define (program) evaluation as an activity whose primary goal is to answer questions that are important to program stakeholders; whereas, I define research as an activity whose primary goal is to answer questions that are important to the scientific community.See (Randolph, 2007b).
2 I mean program in the sense of a project, not in the sense of software.
Given the benefits of systematic reviews of program evaluations, and a lack of such reviews in the field of computer science education, I conducted a systematic, methodological review of the evaluations of K-12 computer science education programs.The research questions answered by this review are listed below: 1. What are the methodological characteristics of computer science education program evaluations? 2. What are the demographic characteristics of computer science education evaluation reports? 3. What are the characteristics of computer science education programs that are being evaluated?4. What is the average effect of a particular type of program on computer science achievement?
The answers to Questions 1, 2, and 3 will help evaluators of computer science education programs acquaint themselves with the methods that have been used in the past, with the trends and contexts of the field, and with the characteristics of the programs that they may be asked to evaluate.The answer to Question 4 will potentially allow evaluators to compare the effects of the programs that they evaluate to the effects of other, similar programs.For example, the answer to Question 4 will allow evaluators to make statements like "the effects of this program are greater than the effects of similar programs", instead of simply stating "this program has an effect greater than zero".Finally, this review, because it draws on evaluations from both computing science and program evaluation traditions, will help bridge the gap between those fields.
In the next section, I discuss the coding procedure, coding variables, literature search, criteria for inclusion, and methods of data analysis that were used.In the results section, I report the methodological, demographic, and program characteristics of all of the evaluations included in the review and report the pooled effect size, in terms of computer science achievement, for eight evaluations in which an experimental or quasi-experimental method was used.I also report the results of a subgroup analysis of types of programs because six of eight effect sizes came from evaluations of the same program.In the discussion section, I report potential biases in the literature, discuss the results for each study question, and point out study limitations.In the final section, I summarize the results, spell out their implications for practitioners and evaluators of computer science education programs, and discuss some of the main challenges for the field.

Methods
In this section, I report on the search strategies used to find relevant evaluation reports, the criteria used for including an evaluation report in this analysis, the variables that were coded, and the procedures for establishing interrater reliability.The variables that were coded can be grouped into four categories: demographic characteristics, program characteristics, evaluation characteristics, and findings.

Search Strategy
Several search strategies were used to find evaluation reports for this review.First, the academic databases -Academic Search Premier, TOC Premier, PRE-CINAHL, Computer Source, ERIC, Library literature and Information Science, Newspaper Source, Psychology and Behavioral Science Collection, PSCYINFO, Social Science Abstracts, Communication and Mass Media Complete, and Vocational and Career Collection -were searched, via EBSCHO HOST, in July of 2004 using the keywords computer science education and program evaluation.The unit of data collection was the evaluation report.In March of 2005, electronic searches of the ACM Digital Library and of the Internet, via the Google search engine, were conducted using six combinations of the phrases "computer science education", "K-12", "evaluation", and "program evaluation".The abstracts, descriptions, or links of the first 200 entries of the Internet and ACM Digital Library searches were examined for each combination until it could be determined that the entry would not plausibly lead to an evaluation report that would meet the criteria for inclusion.
From the references section of the articles that were found from the electronic searches, a branching, hand-search was used to identify other reports that would meet the criteria for inclusion until a point of saturation had been reached.After a preliminary list of evaluation reports was gathered from the electronic and hand searches, an e-mail message was sent to the 2,795 subscribers of the EVALTALK listserv and to the 1,112 members of the ACM SIGCSE-Members listserv on March 15, 2005; the message asked subscribers to send information about evaluation reports that met the criteria for inclusion but were not on the preliminary list.

Criteria for Inclusion
The following criteria were used to determine which evaluation reports, (i.e., reports in which the authors specified that they conducted 'evaluation') would be included in the review: 1.The evaluation report concentrated on a particular computer science education program and not on a particular computer science education strategy or application.2. The report was written in English.3. The direct beneficiaries of the program were K-12 students.4. The programs delivered the types of computer science education content mentioned in the ACM Model Curriculum for K-12 Computer Science (Tucker et al., 2002) to K-12 students.5. Evaluation reports that concentrated only on the evaluation of computer infrastructure for computer science education programs were not included.6. Evaluations of computer science education teacher training programs were only included if they examined students' resulting computer science achievement.7. Studies were included in the meta-analyses proper if there was enough information reported to calculate Cohen's d and if they met the previous six criteria.

Variables Coded and Procedures for the Interrater Reliability Check
After all of the categories for each of the variables were determined and the evaluations were coded by the primary rater, a secondary rater coded the key study characteristic variables -type of inquiry, type of experimental design and study quality -on four randomly selected evaluation reports.Kappa (i.e., Brennan and Prediger's (1981)) κ m ) and percent of overall agreement were used as the interrater agreement statistics.
The variables of the coding sheet were grouped into four categories: demographic characteristics, program characteristics, evaluation characteristics, and findings.Categories for each variable were created using an emergent coding procedure, except for curriculum area, type of inquiry, and evaluation approach, where a priori coding categories was used.The categories that resulted via the emergent coding procedure are presented in Tables 2 through 6.
Demographic Study Characteristics.Demographic variables included evaluation author, country of origin, and source of publication.It also included year of publication.
Program characteristics.Several characteristics of the programs were coded, such as type of program activities, target population, type of school (i.e., public or private), grade level, and type of delivery (i.e., onsite or distance).Additionally, the type of instruction that each program delivered was classified according to the various areas of the Association for Computing Machinery's (ACM) K-12 computer science education curriculum (Tucker et al., 2003).
Evaluation methodology characteristics.Several evaluation methodology characteristics were coded for each evaluation report: the outcomes that the evaluation examined; the type of inquiry used; the type of instrument used; whether the instrument was quantitative, qualitative, or mixed; the moderating variables investigated; and type of evaluation approach.
The categories used for type of inquiry, which are adapted from (Randolph et al., 2005a) 2002) call causal explanation.Causal comparative studies compare two or more groups on an inherent variable.In experimental/quasi experimental investigations, the evaluator compares a factual to a counterfactual condition to make causal conclusions (Shadish et al., 2002).Correlational investigations examine how levels of one variable covary with levels of another variable.If studies were classified as experimental/quasi-experimental, the experimental design was classified into one of the following categories: Pretest-postest with control, pretestpostest without control, posttest with control, one-group posttest-only, and longitudinal.See (Randolph et al., 2005a) for a more-detailed description of these categories.
Stufflebeam's (2001) framework of evaluation approaches was originally used to categorize the sample of evaluations into four categories: questions or methods oriented, decision/improvement oriented, pseudo-evaluation, or social/agenda oriented.However, this variable was abandoned because acceptable levels of interrater reliability could not be established.
Ratings of study quality for experimental/quasi-experimental designs were based on study design and the degree of controls for the threats to internal validity (see Shadish et al., 2002).Study quality was rated as high if the evaluator used a pretest-posttest with control group design or a multiphase, repeated measures design and there was no evidence of threats to internal validity.If there was evidence of threats to internal validity, studies using those designs were rated as medium.A study was rated as high if a pretest-posttest without control group design or a posttest-only with control group design was used and there was no evidence of threats to validity.Otherwise, studies using those designs were rated as medium.Studies that used the one-group posttest-only design were rated as low unless there was very strong evidence that they controlled for threats to internal validity, in which case studies that used that design were rated as medium.
Findings.For experimental/quasi-experimental evaluations that quantitatively examined the effects of a program on computer science ability and reported means and standard deviations, or F or T statistics, Cohen's d was the effect size metric used.

Data Analysis
To answer study questions about demographic, program, and evaluation characteristics; frequencies were calculated using the evaluation case, which were sometimes single reports and sometimes series of evaluations of the same program, as the unit of analysis.For the study question about the average effect of computer science programs on student achievement, the unit of analysis was the evaluation report.Cohen's d, with Hedge's (g U ) bias correction (Rosenthal, 1994), was used as the common metric for outcomes of quantitative measures of computer science achievement (i.e., teacher-or research-made tests).The bias-corrected effect sizes were calculated using Effect Size Calculator (n.d.) software.A variance and within-study sample size / study quality weighting approach as described in (Shadish and Haddock, 1994), was used to weight studies.Lipsey and Wilson's (2001) Metaf SPSS macro was used to calculate statistics for main effects and for interactions between type of program (i.e., either the Nature-Computer Camp program or programs other than Nature-Computer Camp) and outcomes of computer science achievement.A random-effects model was used for these analyses if the fixed-effect homogeneity of variance was rejected, as indicated by a fixed-effect p value of Q total less than 0.05 (Hedges, 1994;Raudenbush, 1994).

Search Results
The EBSCO host search yielded 85 entries; the electronic searches yielded 1,123 entries.Although those entries led to many evaluation reports of computer science education  Akenegbu, 1992;DC, 1983DC, , 1985aDC, , 1985bDC, , 1986;;Negero, 1994 Multiple evaluations investigating academic achievement, behavior and socialization of 6th Grade participants in Nature-Computer Camp from 1983 to 1994.
Berney and Alvarez, 1990aAlvarez, , 1990b An evaluation of a program that provided instruction in computer skills to limited-English-proficient Spanish-speaking students in a New York high school.
Atwater, 1991 An evaluation of Computers Unlimited Magnet Elementary schools that examined program implementation, stakeholder attitudes, academic achievement, and participation of minorities.
Kirkpatrick et al., 1991 An evaluation, using an experimental design, of 21 science, math, and computer enrichment programs.
Atwater, 1992 An evaluation, using standardized tests with experimental designs, of Computers Unlimited Magnet High Schools.

Piña, 1992
An informal evaluation of a computer literacy program.
Seever, 1992 An evaluation using a standardized test and experimental designs of Computers Unlimited Magnet Middle Schools.

Fitzgerald and Hines, 1996
An informal evaluation of a computer science fair for 6th-12th grade students.

Walker and Rodgers, 1996
An evaluation of a program to decrease the pipelining of female students of computer science.
Golan and Means, 1998aMeans, , 1998b;;Penuel et al., 2000Penuel et al., , 2001 programs, only 29 evaluation reports met the criteria for inclusion.The evaluation reports that were included in this review are preceded by an asterisk in the references section.

Demographic Characteristics
Table 2 presents the demographic characteristics for the 19 evaluative cases.It shows that most of the evaluation reports came from North America, most were found from the ERIC database, and that there had been an increasing number of computer science evaluations being reported every decade since the 1970's.

Program Characteristics
Table 3 presents the target participants, their grade levels, the curriculum area that was targeted, and the activities that were conducted in the 19 evaluative cases.In general, the  Note.More than one curriculum area was possible per case or program activity was possible per case.Two cases did not provide enough information to determine the curriculum area.
data in Table 3 indicate that general education, high school students were most often the target participants of the programs.The curriculum areas that were targeted correspond with the 6th-8th grade and 9th-10th grade levels of the Tucker et al. (2003) curriculum.
The data also indicate that student instruction was the most frequent type of program activity.Unfortunately, the evaluation reports, in general, did not report in detail what approach to student instruction was taken.

Methodological Characteristics
Table 4 presents the findings that concern the program outcomes and the student-level factors that were examined.Stakeholder attitudes were the most frequently investigated outcome, followed by levels of enrollment, achievement in core subjects, and achieve- ment in computer science.Race/ethnic origin, aptitude, and gender were the student-level factors that were examined in the 19 evaluative cases.Table 5 presents information about the measures used in the 19 evaluative cases in this sample.The most frequent type of measures were questionnaires, existing records (e.g., attendance logs), and standardized tests.Of the 67 measures that were used in these evaluation cases, quantitative measures were used more frequently than qualitative or mixed-methods measures.
A crosstabuluation of the measures and outcomes, which is not presented here because of its large size and sparseness, showed that the measures were correctly matched with outcomes.For example, questionnaires or focus groups, which are considered to be appropriate means of collecting data about attitudes (Frechtling et al., 2002), were used in 16 out of 17 cases in which stakeholder attitudes were examined.In the nine cases where computer science achievement was measured, the most frequently used measures were teacher-or researcher-made tests (5 out of 9), direct observation (2 out of 9), and standardized tests (1 out of 9), all of which are generally considered by the public to be appropriate measures of learning (Frechtling et al., 2002).Only one evaluation used self-report questionnaires, which are generally considered to be unreliable measures of learning (Almstrum et al., 2002, Silka, 1989), to measure computer science achievement.
Table 6 presents the frequencies of the various types of inquiry the evaluators used, the frequencies of experimental designs that were used, and information about study quality when experimental designs were used.Qualitative or experimental/quasi-experimental inquiries were the most common.Of the experimental designs, the pretest-posttest design with a control group was the most frequently used design, followed closely by the onegroup posttest-only design.

Evaluation Findings
Fig. 1 shows the effect sizes, their 95% confidence intervals, and the n-sizes of the eight studies that quantitatively investigated the effects of a program on computer science achievement, used an experimental or quasi-experimental design, and gave enough information to calculate Cohen's d.At the bottom of Fig. 1, the weighted, average effect size and its confidence intervals is shown.
As indicated in Table 7, the weighted, average effect size (using a random-effects model) for the eight evaluations on computer science achievement (i.e., teacher-or researcher-made tests or quizzes) was 1.10 with 95% lower and upper confidence intervals of 0.72 and 1.47.Since Q total for the pooled estimate, using a fixed-effects model, indicated heterogeneity of effect sizes across evaluations, a random-effects model was used.Homogeneity of effect sizes was found using a random-effects model, as indicated by a Q total with a p value greater than 0.05 (see Table 7).Since six out of eight effect sizes came from evaluations of the same program (i.e., Nature-Computer Camp), I present the results of a subgroup analysis of Nature-Computer Camp program evaluations and evaluations of programs other than the Nature-Computer Camp.The data in Table 7 indicate that there was neither a statistically significant difference between the groups nor a large difference between the effect sizes of the two groups of evaluations.
Note.Effect sizes in the positive direction indicate an increase in computer science achievement.Evaluations reports followed by an asterisk are Nature-Computer Camp Evaluations.The number in parentheses is the N -size for each evaluation.

Sensitivity Analysis: Random versus Fixed Models
All of the sources of variance presented in Table 7 were statistically significant using a fixed-effects model; however, none of the sources of variance were statistically significant using a random-effects model.This discrepancy is not uncommon, however, because a random-effects model is generally more conservative than its corresponding fixed-effect model when there is a large amount of variance unaccounted for (Hedges, 1994).The results of homogeneity tests presented in Table 7 indicate that the random-effects model, however, had a better fit with these data than the fixed-effects model.

Coding Reliability
Kappa was 1.0 and percent of overall agreement was 100 for type of inquiry and study design.For quality of study, kappa was .62 and overall percent of agreement was 75.

Potential Biases in the Reviewed Literature
Assuming that the universe of computer science education evaluations would be proportionally distributed across the globe and be published in a variety of sources, I am inclined to believe that this sample over-represents North American, general-education-centered evaluations (see Table 2).Although the literature search was fairly comprehensive and used international databases that were grounded both in education and computer science, I hypothesize that there are plenty of computer science education program evaluations being done; however, it is primarily North American evaluators who publish their evaluation reports in sources that are highly indexed by academic databases or Internet search engines.
Another possible bias is that six of the eight evaluation cases included in the metaanalysis evaluated the same program: Nature-Computer Camp.In order to investigate this possible source of bias, I conducted subgroup analyses, the results of which are presented in Table 7.The results showed that there were no practically or statistically significant differences between the outcomes of Nature-Computer Camp evaluations and the outcomes of other computer science education program evaluations.

Program Characteristics
The majority of programs provided various kinds of student instruction targeted at K-12 students.Because of the well-documented pipelining of female students in computer science (Gürer and Camp, 2002), it is surprising that so few programs were geared towards females (see Table 3) and that so few evaluations examined gender interactions (see Table 4).

Methodological Characteristics
Surprisingly, computer science achievement was only the fourth most frequent outcome that was examined (see Table 4).Stakeholders attitudes, enrollment, and achievement in core subjects, which are known correlates of computer science achievement, were outcomes that were all examined more frequently than computer science achievement itself.
The frequency of types of measures that were used (see Table 5) align well with the frequency of outcomes that were examined (see Table 4).Stakeholder attitudes were measured through questionnaires, enrollment was measured through existing records, academic achievement on core subjects was measured through standardized tests, and computer science achievement was measured by teacher-or researcher-made tests.
The fact that the only measure of computer science achievement that reported validity or reliability estimates (Palormo, n.d.) is no longer available and that all other measures of computer science achievement were localized teacher-or researcher-made tests indicates a lack of validated, reliable, standardized measures in computer science education, or a lack of awareness about them.According to Haas and Hassell (1983) there was a need for reliable and validated measure of the effectiveness of computing education over 20 years ago; from the data in this review it appears that this is still the case today.Computer science evaluators might benefit from the work of Cooper, Cassel, Moskal, and Cunningham (2005), who give guidelines for creating outcomes-based measures for computer science education, or from (Fincher and Petre, 2004).Although there is a validated and reliable computer science subject test developed for the Graduate Record Examination (Educational Testing Service, 2004) it is neither available for administration by evaluators nor is it targeted for K-12 students.
The distribution of types of inquiry in this sample of evaluations is similar to the distributions of types of inquiry in educational technology research journals (see Randolph et al., 2005b).There are almost equal frequencies of survey research, qualitative research, and experimental/quasi-experimental research.The most frequently used design (i.e., the pretest-posttest design with controls) in these evaluations is a strong design that controls for many threats to internal validity; the second most frequently used design (i.e., the one-group posttest-only design) is a weak design that is vulnerable to almost all threats to internal validity (Shadish et al., 2002).Overall, based on study design, most of the experimental or quasi-experimental investigations in the evaluations in this sample were deemed as having high or medium quality (see Table 6).

Evaluation Findings: Computer Science Achievement
Fig. 1 shows the effect sizes for each of the evaluations that used an experimental evaluation design and shows that the average standardized mean difference effect size was 1.10 on standardized or teacher-or researcher-made tests of computer science achievement.The confidence intervals for this estimate indicate that it is plausible that the effect size parameter might be as low as 0.72 or as high as 1.47.The programs' durations were one academic year or semester, except for the Nature-Computer Camp program, which consisted of five one-week sessions.
To aid in the interpretation of that effect size, in Table 8 I present a binomial effect size display (see Rosenthal et al., 2005), which reframes an effect size of 1.10 as a twoby-two table showing how many students, on average, would be expected to improve and not improve as a result of participating in a computer science education program similar to the ones listed in Fig. 1.So, in this case, an effect size of 1.10 is statistically equivalent to about 73 out of 100 students showing improved achievement on teacher-or researcher-made computer science tests or quizzes after participating in a computer science education program.(If the computer science education programs had had no effect, by chance only 50 out 100 students would have been expected to have improved scores.) It is no surprise that computer science instruction, in general, led to an increase in students' scores on computer science tests; however, this result -that students' scores increased by 1.10 standard deviations -might be useful to evaluators who want to compare the outcomes of the computer science program they are evaluating to the outcomes of the computer science programs reviewed here.For example, for a computer science education program to be as effective as the ones included here, about 73 out of 100 students would need to have improved scores on teacher-or researcher-made measures of computer science achievement.

Study Limitations
Assuming that these data over-represent North-American, general-education-targeted programs in computer science education, the results are best generalized to those types of evaluations.Also, unfortunately, there were too few studies that could be included in the meta-analysis to determine what variations of the computer science education interventions were most effective, under which settings, with which types of participants, and under what research conditions.It was only possible to determine what effect the past computer science programs, in general, had on computer science achievement and whether there was a difference between the Nature-Computer Camp programs and other programs.

Conclusion
In summary, this review reported on an analysis of 29 evaluation reports of computer science education programs.Most of the programs that were evaluated offered direct computer science instruction to general education, high school students in North America.Most frequently, evaluators examined stakeholder attitudes, program enrollment, academic achievement in core courses, and achievement in computer science.The most frequently used measures were questionnaires, existing sources of data, standardized tests, and teacher-or researcher-made tests.The pooled effect size for eight programs that administered teacher-or researcher-made tests of computer science achievement was 1.10, which is statistically equivalent to 73 out of 100 students who participated in the program having made an improvement in computer science achievement.The implications of this research for the practitioners in and designers of K-12 computer science education programs are that programs that concentrate on student instruction; from short, repeated, off-campus programs that combine computer science education with other subjects (such as Nature-Computer Camp) to programs that are a part of the regular school curriculum (such as the programs reported in (Durward, 1973), or (Still, 1985); are effective in increasing computer science achievement.It is not known whether other types of programs (e.g., programs that concentrate only on teacherinstruction) are effective.Also, what still remains to be seen is what aspects of those programs lead to increased achievement and which do not.Unfortunately, the reporting of the activities involved in those programs is insufficient for a practitioner or program planner to replicate those programs in detail and they are also insufficient for a researcher to investigate which aspects of the program lead to increases in achievement.Another finding of import to computer science education practitioners and program funders is that there have been surprisingly few programs designed to bridge the gender gap in computer science.Only 3 out of the 19 evaluations investigated here were intended to help bridge that gap.
The results presented here can also help identify some of the strengths and weakness of the current methods of computer science education evaluation.One strength is that the methods of data collection align well with the outcomes being investigated.For example, test or direct observations, rather than self-reports of learning, tend to be used to measure computer science achievement.Another strength is that when experimental designs are used, high-quality designs tend to be used and the experiments are adequately controlled.Also, computer science education evaluators tend to use a wide variety of approaches to investigate their questions, from survey research, to experimental research, to qualitative research.
Concerning the weaknesses, first, although or researcher-made tests have much ecological validity (i.e., they are not outside of the scope of how students are used to being evaluated), those types of measures usually lack data about their reliability or validity.In all of the evaluations used here, only one used a standardized test that had validity or reliability information; however, that test (Palormo, n. d.) is no longer available.Therefore, there is a dire need for standardized, reliable, and valid measures of K-12 computer science achievement.
Second, because there is such a low degree of enrollment and such a high degree of attrition in postsecondary computer science education, it seems appropriate that student or teacher attitudes about a program was a frequently measured outcome -the rationale being that increased satisfaction with a program will increase enrollment and decrease attrition.However, it is surprising that computer science achievement, which is the obvious goal of most computer science education programs, is only the fourth most frequently measured outcome, behind stakeholder attitudes about the program, program enrollment, and achievement in core courses, in that order.This could also be related to the fact that there are not standardized, reliable, and valid measures of K-12 computer science education achievement.
Third, gender was only examined as a mediating or moderating variable in 3 out of 19 evaluations (see Table 4).This is a surprising finding given the egregious gender gap in the field of computer science (Gürer and Camp, 2002).
Fourth, and finally, the descriptions of program activities in evaluation reports tend to provide too little detail for other practitioners to replicate the program or for researchers to investigate the links between the different kinds of program activities and program outcomes.Understandably evaluation research is primarily meant to answer questions that are of interest to local stakeholders; however, simply reporting program activities in more detail would lead to increased utility of program results by others outside of the program (like practitioners in and evaluators of other similar programs).
Based on the implications mentioned above, two major challenges for computer science evaluation (and research) become clear.Those challenges are (a) to develop standardized, reliable, and valid measures of K-12 computer science achievement that are aligned with Tucker et. al.'s (2003)
, were survey research, qualitative research, causal-comparative research, experimental/quasi-experimental research, correlational research, or classification research.The definitions for each category are explained briefly below.Survey research describes the characteristics of a population without comparing groups or making causal conclusions.Qualitative studies explain a phenomenon through what Mohr (1999) calls physical causal reasoning, through what Scriven (1976) calls the modus operandi approach, or through what Shadish, Cook, and Campbell (

Table 1
Description of program evaluations included in this review ; Means et al., 2001 A series of reports that used a variety of designs and measures to evaluate the Silicon Valley Challenge 2000 program from 1998 to 2001.

Table 2
Demographic characteristics: region, source, and decade of publication

Table 3
Program characteristics: grade level, target population, curriculum area, activities

Table 5
Methodological characteristics: measures

Table 6
Methodological characteristics: type of inquiry, experimental design, and study quality

Table 7
Aggregated and disaggregated effect sizes for computer science achievement

Table 8
Binomial effect size display for computer science achievement (d = 1.10) computing curriculum, (b) to investigate whether program activities have differential outcomes based on gender, and (c) to begin to attempt to causally link program activities with program outcomes.It is clear that computer science education works, what is significantly less clear is what aspects of computer science education work best, for whom, and why.