Assessing Product Creativity in Computing Education: A Systematic Mapping Study

Creativity has emerged as an important 21st-century competency. Although it is traditionally associated with arts and literature, it can also be developed as part of computing education. Therefore, this article -presents a systematic mapping of approaches for assessing creativity based on the analysis of computer programs created by the students. As result, only ten approaches reported in eleven articles have been encountered. These reveal the absence of a commonly accepted definition of product creativity customized to computer education, confirming only originality as one of the well-established characteristics. Several approaches seem to lack clearly defined criteria for effective, efficient and useful creativity assessment. Diverse techniques are used including rubrics, mathematical models and machine learning, supporting manual and automated approaches. Few performed a comprehensive evaluation of the proposed approach regarding their reliability and validity. These results can help instructors to choose and adopt assessment approaches and guide researchers by pointing out shortcomings.


Introduction
In today's rapidly changing digital society, creativity is considered one of the main 21st century competencies essential for professional and personal success (Kaufman and Beghetto, 2009;Voogt and Roblin, 2012). Consequently, developing students' creativity from an early age has become a dominant concern (Voogt and Roblin, 2012;Beghetto, 2010). Diverse curriculum frameworks also explicitly express the need for K-12 schools to foster creativity (P21, 2020;ISTE, 2020;Voogt and Roblin, 2012). Supplying students with opportunities to engage in creative ways can help them to develop the capacities to undertake work that cannot easily be automated and address increasingly complex challenges with out-of-the-box solutions (Sternberg, 2015).
And, although, creativity is traditionally associated with arts, music, and literature, it can also be developed as part of other knowledge areas, such as computing, for which design, research, and innovation are required . Teaching the solution of computational problems by creating novel and appropriate/useful computer programs can allow students to express their ideas (Romero et al., 2017). Computing may also nurture competencies, such as imagination, visualization, abstraction, to solve problems creatively (Clements and Gullo, 1984;Yadav and Cooper, 2017;Grover and Pea, 2013) while, on the other hand, creative skills enhance solving algorithmic problems, creating computational artifacts, and developing new knowledge (Shell et al., 2014). In K-12 education, this is typically introduced by teaching the development of computer programs using visual block-based programming languages, such as Scratch, BYOB/ Snap! or App Inventor (Lye and Koh, 2014). In order to stimulate creativity, programming activities are often posed as open-ended ill-defined problems in a constructivist context adopting a problem-based and design-based learning strategy as "learning-bymaking" (Bjögvinsson et al., 2012).
Yet, a critical dimension of education is assessment in order to measure to which regard learning objectives have been achieved expressed as grades as well as a feedback mechanism to guide the learning and the teaching process. However, assessing creativity is challenging (Henriksen et al., 2015). Especially in the context of problem-based learning, performance assessments can be conducted concerning the creative product as one crucial strand of creativity (Rhodes, 1961), representing the outcome of the learning process by the students (Ritchie, 2001). Such assessments typically measure whether the properties assigned to creative products are present in the outcomes and to what degree.
Although product creativity is commonly defined in terms of novelty and appropriateness (Jackson and Messick, 1964), thus, regarding how much a product differs from the norm and meets the practical needs of the problem situation, there exists no single definition of creativity. Frequently other terms such as style, including "organic, well-crafted, elegant" (O'Quin, 1987) and transformation, which refers to "some objects combine elements in ways that defy tradition" (Jackson and Messick, 1964) are also used in the definition of product creativity.
Yet, a question is which approaches exist for assessing creativity based on computer programs as outcomes of computing education. And, although there exist already a variety of reviews on creativity assessments, they are mostly targeted on general approaches to measure creativity and not focusing on specific strands (Said-Metwaly et al., 2017). Bolden et al. (2019) and Snyder et al. (2019) provide overviews of the assessment of creativity in any discipline in K-12, not targeting specifically on computing education. Very few reviews are related to creativity in the context of computer science, analyzing approaches for leveraging creativity in agile requirements software engineering (Aldave et al., 2019) and individual creativity support systems (Wang and Nickerson, 2017). However, these reviews in the computing context typically do not focus on an educational perspective nor specifically on the product.
On the other hand, reviews on assessments based on the analysis of computer programs typically focus only on computational thinking concepts, not covering creativity (Araujo et al., 2016;Alves et al., 2019;Cutumisu et al., 2019). Other studies such as Clements (1995), Scherer et al. (2019) focus on the effects on process creativity by learning computer programming, rather than concentrating on product creativity measurement. Typically, psychological tests such as the Torrance Test of Creative Thinking (TTCT) are used in these studies to assess creativity with respect to the creative process and not the product (Scherer et al., 2019).
Thus, despite the recognition of the importance of assessing creativity in the context of the creative product created as an outcome of computing education, a detailed overview on how to design such assessments in a reliable, valid way useful to both learners and teachers is currently inexistent. Therefore, we aim at advancing the understanding of how assessment can support and promote creativity in the classroom before mentioned context by performing a systematic mapping study. The results of the study can help instructors to systematically choose and adopt effective assessment approaches, as well as guide researchers by pointing out shortcomings in the existing approaches.

Measuring Creativity
While there are many definitions of creativity, there are also disagreements leading to a lack of a standard definition and an inconsistent understanding (Walia, 2019). Yet, as creativity is multidimensional and can be represented from different perspectives, the way it is defined influences how to measure and assess creativity (Walia, 2019).
Creativity is generally defined in terms of the capacity to generate new, original, or surprising ideas and solutions (Walia, 2019). A creative idea or product is considered original if it represents something novel or surprising that did not exist before (Runco and Jaeger, 2012). However, novelty alone does not make something creative. Creative ideas or products also have to include an underlying value and usefulness, providing solutions that are appropriate, functional, correct, and valuable (Jackson and Messick, 1964). Most definitions of creativity, thus, identify originality and appropriateness as key characteristics of creative outcomes (Besemer and Treffinger, 1981;Runco and Jaeger, 2012). Beyond these common characteristics, several definitions also consider additional elements. Examples include "wholeness", considering aesthetic dimensions, situated within the work's specific context as proposed by Mishra and Henriksen (2013), detail and elegance (Besemer and Treffinger, 1981), among others.
Aiming at a better understanding, diverse researches have also proposed ways to structure the definition of creativity. Among those is Rhode's (1961) widely recognized Four P's framework for creativity ( Fig. 1 While all four aspects of the framework play a role in understanding creativity, the analysis of the creative product as one strand may allow insights on the concept of creativity as a whole. Thus, although, the assessment of creativity cannot be confined to the point of view of the product (Ritchie, 2001), it certainly represents a vital part to provide "a means for establishing referents for the concept 'creativity' through a systematic evaluation of things which people produce" (Skager et al., 1966).

Product Creativity
In this respect, a product-based approach measures whether the properties typically assigned to creative products are present in the outcomes and to what degree. Yet, again, there does not exist a conclusive set of these criteria, although the criteria for measuring creative products have been widely discussed (Besemer and Treffinger 1981;Ranjan et al., 2018). Originality, appropriateness, and condensation are considered by most authors as the most important characteristics of a creative product, often using varying terms (Table 1).
There are others characteristics defined by some authors that represent extreme forms of the characteristics presented in Table 1, such as transformation, which refers to combining "elements in ways that defy tradition and that yield a new perspective" and "forces us to see reality in a new way" (Jackson and Messick, 1964). However, these can be grouped with the core characteristics, for example by understanding transformation as an extreme form of originality as being a revolutionary product "that it forces a shift in the way that reality is perceived" (Besemer and Treffinger, 1981).

Assessment of Product Creativity
In everyday life, assessing creativity happens naturally, but in the classroom, it must move beyond such subjective measurements (Mishra and Henriksen, 2013). And, although some assessment approaches are considered "golden standards", such as the Torrance Tests of Creative Thinking, a psychological measurement of an individual's divergent thinking, they may provide little practical use in the classroom (Kaufman et al., 2016). Especially when considering assessment in active learning environments using problem-and design-based strategies following a constructionist theory, it becomes clear that measuring the outcomes of these practical learning experiences plays a crucial role and has the potential to be highly authentic (Bialik et al., 2016). In this context, an alternative widely used is the assessment of product creativity seeking to evaluate the outcomes of the creative process created by the students (Long, 2014;Mishra and Henriksen, 2017). These "products" in the context of computing education are typically computer programs, such as games or mobile applications.
This kind of assessment typically focuses on an analysis of the product to measure the creativity of the output by some standard, verifiable (reliable) measure (Table 2). Many approaches use a scale, such as the Student Product Assessment Form (SPAF) (Renzulli and Reis, 1991). The SPAF uses a 5-point ordinal rating scale (from poor to outstanding) to analyze creativity criteria related to quality, care, attention to detail, appropriateness, and originality (Renzulli and Reis, 1991). O'Quin and Besemer (1989)  Characteristics and other terms widely used for the definition of product creativity

Characteristic
Description Synonyms used Originality refers to a product "level of surprisingness, and its projected germinal qualities (characteristics related to perceived influence in suggesting spin-offs or other new products)" (O'Quin, 1987). And to the extent of "newness of the product: in terms of the number and extent of new processes, new techniques. new materials, new concepts included" (Besemer and Treffinger, 1981 Appropriateness "measures the extent to which a product meets the practical needs of the problem situation" (O'Quin, 1987). "To be appropriate a product must fit its context. It must "make sense" in light of the demands of the situation and the desires of the producer" (Jackson and Messick, 1964 Condensation refers to "the fact that sometimes the initial design is elaborated and made more complex through working out the solution and other times (or simultaneously) the design may be refined and made simpler" (Besemer, 2000). " It considers the aspects of style or production values" (O'Quin, 1987). "In the highest forms of creative condensation the polar concepts of simplicity and complexity are unified" (Jackson and Messick, 1964 (Mishra and Henriksen, 2013). Yet, in the context of open-ended assignments in problem-and design-based learning, these judgments become more complex and difficult, leaving educators to subjective assessments (Dousay, 2018). This may lead to inaccurate results, especially by interdisciplinary educators that may lack competence for accurate assessment of computer programs. In this context, one of the most popular is the Consensual Assessment Technique (CAT) (Amabile, 1982) that relies on a panel of expert judges (instructors or peers) to assess the creativity of the product and analyze several different characteristics depending on the panel aiming at consensus (Table 2).
Comparing existing product assessment approaches, differences concerning the assessment criteria become obvious (Long, 2014) (Table 2). However, the reliable and valid definition of these criteria is essential to measure creative products and identify what makes them different from non-creative products.

Creativity in Computing Education
While creativity is traditionally associated with arts, music, and literature, it can likewise be developed as part of other knowledge areas, such as computing .  Amabile, 1982 Computing is recognized as a creative human activity that allows the exploration and creation of knowledge, enables innovation and allows individuals to deploy technology towards creating novel artifacts (Mishra and Yadav, 2013). Teaching computing in school focuses typically on computational thinking aiming at expressing solutions as computational steps or algorithms that can be carried out by a computer (CSTA, 2016). It involves solving problems, designing systems, and understanding human behavior by drawing on the concepts fundamental to computer science (Wing, 2006). Computational thinking has also been linked to creativity and innovation, as it provides learners with the opportunity to express their ideas and create innovative programs to new and unexpected problems (DeSchryver and Yadav, 2015). Computational thinking may nurture competencies, such as imagination, visualization, abstraction, to creatively solve problems and applying computational thinking principles in problem-solving (Yadav and Cooper, 2017), while, on the other hand, creative skills enhance solving algorithmic problems, creating computational artifacts, and developing new knowledge (Shell et al., 2014).
Moreover, the creative use of digital technologies to solve diverse problems engages students in an active design and creation process using computational concepts and methods (Romero et al., 2017). By moving from computational thinking to computational making (Rode et al., 2015), computing education in K-12 allows students to learn to create, test and refine computer programs (Shute et al., 2017;CSTA, 2016;Lye and Koh, 2014), enabling them to creatively express themselves, concretize their ideas, and develop diverse and innovative ways to build and to learn (Clements, 1995;Grover and Pea 2013). In this respect, creativity is one of the keys to respond common challenges in the development of computer programs today (Robertson, 2005), as programming is not only about writing computer programs but also about competencies: To analyze context and requirements (Robertson, 2005). • To ideate novel, useful and technically feasible solutions (Romero • et al., 2017). To design a computer program by modeling data and architecture (Gu and Tong, • 2004).
To design a usable and visually attractive user interface (Rode • et al., 2015;Ferreira et al., 2019). To implement and test code (Glass, 1995). • Thus, in this context, creativity can be seen as an ability to apply imagination to create a computer program that is judged to be novel and appropriate, useful, and valuable providing solutions to practical problems (DeSchryver and Yadav, 2015).
Many computing education initiatives in K-12 introduce algorithms & programming concepts by using visual block-based programming languages, such as Scratch, BYOB/Snap! or App Inventor (Lye and Koh, 2014). Programming activities are often posed as open-ended ill-defined problems in a constructivist context adopting a problem-based and design-based learning strategy (Bjögvinsson et al., 2012). Thinking of it as "learning-by-making"' driven by the maker movement, this creative computing approach inspires learners to create their outcomes engaging them in the construction of digital and tangible artifacts through the use of technologies (Rode et al., 2015;Brennan et al., 2019). And, in order to provide students with the opportunity to do computing in ways that have a direct impact on their lives and their communities, often a perspective of computational action (Tissenbaum et al., 2019) is adopted focusing on real-world problems usually in an interdisciplinary way (Dousay, 2018). These activities aim to stimulate the development of higher-order thinking competencies not prescribing a correct or best solution in advance and give students more freedom to choose abstract concepts for creating a solution. As a result, students create their own animations games or mobile applications to solve real-world problems providing opportunities for students "to extend their creative expression to solve problems, create computational artifacts" (Yadav and Cooper, 2017).

Definition and Execution of the Systematic Mapping Study
In order to elicit the state of the art on approaches for assessing product creativity (or some of its characteristics) based on the analysis of computer programs developed by students in an educational context as an outcome of the learning process, we performed a systematic mapping following the procedure defined by Petersen et al. (2015).

Definition of the Review Protocol
Research Question. Which studies exist for the assessment of product creativity, or some of its characteristics, of computer programs in the educational context?
We refined this research question into the following analysis questions:

AQ1.
Which studies exist and for what kind of product and educational stage? AQ2. What is the definition of the product creativity characteristics being assessed? AQ3. How are these creativity characteristics analyzed? AQ4. What is the context and sample size of the application of the approach? AQ5. If, and how the approach has been evaluated?
Data source. We examined all published English-language articles that are available on Scopus, the largest abstract and citation database of peer-reviewed literature, including publications from ACM, Elsevier, IEEE and Springer with free access through the Capes Portal 1 .
Inclusion/exclusion criteria. We considered only peer-reviewed English-language articles that present a form of assessment of product creativity, or some of its characteristics, based on the analysis of computer programs. Here we understand assessment to a wide extent, including any kind of measurement or analysis of the creative product in an educational context, considering also approaches that not necessarily include grading and/or a feedback mechanism. We only consider studies within an educational context. Due to the sparseness of research in this area, we include approaches to all educational stages. We consider articles that have been published until December 2019.
On the other hand, we exclude studies that assess creativity only related to other strands of creativity than the product. Thus, studies focusing exclusively on press, person, and process creativity are excluded. We consider only articles that present substantial information to enable the extraction of relevant information regarding the analysis questions, and, therefore, abstract-only or one-page articles are excluded.
Definition of the search string. Following our research objective, we define the search string by identifying core concepts, also considering synonyms, as indicated in Table 3. The term creativity is chosen, as it expresses the main concept to be searched. As originality is one characteristic widely accepted in the field (Runco and Albert, 2010) we choose to use it as a synonym of creativity to broaden the search results. Although originality alone is not sufficient to classify a product as being creative, independently of what other positive qualities it may have, it is generally considered an important characteristic for a creative product to possess (Jackson and Messick, 1964). We do not use other synonyms of originality such as novel, infrequency or unusualness, as these terms are used in many contexts with meanings unrelated to creativity. We also focus the search on the keyword assessment, including synonyms that are commonly used in the educational context. Keywords related to educational context are chosen to limit results to this specific context. Considering our focus on computing education, specifically the assessment of product creativity based on computer programs, we also include terms related to this domain. We use computational thinking as a synonym for programming/coding. And, although computational thinking covers a much wider field than just programming and coding, it is frequently used as a synonym to these terms in the literature (Armoni, 2016). We use wildcard characters to cover as many variations of the terms as possible, such as creativ* representing "creative" and "creativity".
Using these keywords, the search string has been calibrated and adapted in conformance with the specific syntax of the data source: TITLE-ABS-KEY(( creativ* OR original* ) AND ( assess* OR measur* OR evaluat* OR analy* ) AND ( "K-12" OR school OR education OR learning) AND ( coding OR programming OR "computational thinking")) Table 3 Keywords

Execution of the Search
The search has been executed in January 2020 by the first author and revised by the coauthors. As Scopus allows to filter results based on the field, we choose to exclude works on unrelated fields such as Medicine (225), Biochemistry, Genetics, and Molecular Biology (185), Agricultural, Biological Sciences (158), Business, Management and Accounting (156), Health Professions (34), Nursing (26), Economics, Econometrics, and Finance (24), Pharmacology, Toxicology, and Pharmaceutics (13), Immunology and Microbiology (12), Dentistry (2), and Veterinary (1). In the first analysis stage, we quickly reviewed titles, abstracts, and keywords of all filtered search results (1,837 articles) to identify articles that match the exclusion criteria, resulting in 89 potentially relevant articles. In the second stage, we analyzed the full-text of the pre-selected articles. We found 11 articles that analyze product creativity based on computer programs created by the students. All authors participated in the selection process and discussed the selection of papers until a consensus was reached (Table 4). Many articles have been excluded based on the analysis of their abstracts as they are related to other fields such as video coding, artificial intelligence, deep learning, because 'original' is a term widely used to describe, for example, datasets. In addition, a large number of articles using the term 'originality' with a different meaning only to indicate the novelty of a study have been excluded. During the full-text analysis of the remaining Table 4 Quantity of articles per selection stage Initial search results articles, several have been excluded as they present approaches exclusively analyzing creativity concerning other strands that are outside of the focus of our research, such as press (Engelman et al., 2017), process (Perez-Poch et al, 2016) or person (Engelman et al., 2017). As a result, we identified a total of 11 articles relevant to our research objective (Table 5). All selected articles were published within the last nine years as shown in Fig. 2, which also indicates the recent importance of this topic. As Koh et al. (2011) and Bennett et al. (2013) present the same approach: Computational Thinking Patterns (CTP) and creativity analysis through Computational Thinking Pattern Analysis (CTPA) Bennett et al., 2013), both articles are grouped in the analysis.

Data Analysis
In this section, we present our findings for each of the analysis questions.

Which studies exist and for what kind of product and educational stage?
We found 11 articles describing 10 approaches, as two articles present the same approach just from a different perspective Bennett et al., 2013). The approaches purposefully analyze a student's program created as an outcome of the learning process to assess creativity, identify characteristics of product creativity, and study its relationship Computing creativity: Divergence in computational thinking Bennett et al., 2013 3 Automated indicators to assess the creativity of solutions to programming exercises Manske & Hoppe, 2014 4 An exploration of three-dimensional integrated assessment for computational thinking Zhong et al., 2015 5 Suggesting a log-based creativity measurement for online programming learning environment Gal et al., 2017 6 Identifying original projects in app inventor Mustafaraj et al., 2017 7 Computational thinking development through creative programming in higher education with other concepts, e.g., computational thinking. In the computing education context, the products assessed vary from games , solutions to well-defined programming activities (Manske and Hoppe, 2014;Gal et al., 2017), mobile apps developed in class and published to public galleries Turbak et al., 2017), projects as results of creative programming activities (Romero et al., 2017) as well as free-choice, open-ended projects (Grover et al., 2018;Basu, 2019). Some products are a result of a well-defined activity (with a known solution in advance) or ill-defined activity (without or with more than one known solution known in advance) ( Table 6). The analysis of creativity based on the students' computer programs is provided for diverse programming environments/languages, especially block-based visual programming environments such as Scratch and App Inventor that are typically used for computing education in K-12 (Fig. 3). Some studies are also conducted in a more generic manner covering more than one programming environment/language, e.g., using rubrics proposed by Basu (2019) or Grover et al. (2018).
The approaches target different educational stages. The majority of studies were designed for some stage in K-12 education (Bennett et al., 2013;Basu, 2019;  2018), while others target higher education (Romero et al., 2017;Mustafaraj et al., 2017;Turbak et al., 2017) (Fig. 4). One approach uses as input data from a website that contains a set of well-defined programming activities (Manske and Hoppe, 2014) without clearly indicating the considered educational stage.

What is the Definition of the Product Creativity Characteristics
Being Assessed?
In alignment with the specific focus of this review, all articles focus on product creativity only, except for the work by Hershkovitz et al.. (2019), which also analyzed characteristics related to the creative process. One article does not inform which characteristics of creativity were analyzed (Romero et al., 2017).
Detailing the specific characteristics of product creativity that are assessed, we observe that the most analyzed characteristic is originality, analyzing the newness of the product. And, although, authors use different terms, such as novelty (Basu, 2019;Grover et al., 2018) or divergence (Bennett et al., 2013;Koh et al., 2011), they refer to the same concept of originality. Grouping these terms, originality was analyzed by 8 of 10 approaches that informed the analyzed characteristics (Fig. 5).
Typically, originality is assessed by comparing the student's computer program with a specific set of computer programs. This set can contain all other student's computer programs (Gal et al., 2017), or pre-programmed solutions and patterns for well-defined activities . The indicator of originality, novelty or divergence is then measured by the extent to which the student's computer program is different from this set. Originality is also assessed by using more subjective criteria, as proposed by Grover et al. (2018) and Basu (2019), having the instructor to assess computer programs on an ordinal scale as "not very novel, some novelty, or very novel" (Basu, 2019).
The characteristic condensation seems to be fuzzier to assess than originality, as it depends on the domain-specific interpretation (Manske and Hoppe, 2014). Different to product originality that can be somewhat agreed on independently from the domain, terms related to condensation, such as sophistication and elegance are very domain- specific without a general agreement outside specific domains. In this regard, in the context of software engineering, elegance is measured using software metrics where "experts infer a weighting and interpretation to these metrics" (Manske and Hoppe, 2014). Other terms related to condensation, such as completeness and standardization are more straightforward to assess. For example, completeness is measured by verifying if the computer program is completed and standardization if the computer program follows some defined pattern or formatting rule (Zhong et al., 2015).
Appropriateness is assessed only by a few approaches. Manske and Hoppe (2014) use the term usefulness and defined that it is achieved if the student's computer program is correct for the activity for which it was submitted (Manske and Hoppe, 2014). Basu (2019) uses the term correctness in a similar way and defines more subjective assessment criteria from "the programs contain several errors" to "program runs correctly without error and the output is appropriate".

How are these Creativity Characteristics Analyzed?
The approaches encountered in this mapping vary largely concerning the type of assessment and methods and techniques used. With respect to whom performs the assessment, we found instructor assessment using rubrics (Grover et al., 2018;Basu, 2019), expert assessment based on personal knowledge as input to automated assessment (Manske and Hoppe, 2014) as well as automated assessment based on techniques from computer science and mathematics Bennett et al., 2013;Gal et al., 2017;Mustafaraj et al., 2017;Turbak et al., 2017). Studies that focus on the relationship between creativity and computational thinking analyze characteristics of the process rather than the product, using psychological tests, such as the Torrance Tests of Creative Thinking (Hershkovitz et al., 2019). Half the approaches propose assessment techniques that are performed manually, while the other half are automated. One way of assessing the creative product is by the instructor manually assessing the outcome created by the students using rubrics (Fig. 6). Rubrics consists of a matrix of criteria and performance levels for each criterion. Such rubrics typically contain one or more criteria related to creativity or its characteristics (Fig. 6), e.g., novelty or originality and condensation or engagement, along with the respective performance levels (Grover et al., 2018;Basu, 2019;Zhong et al., 2015).
Using concepts of education, computer science, and mathematics, the approaches analyze product creativity. In addition, the study presented by Hershkovitz et al. (2019) also uses the TTCT test as a psychological instrument (Fig. 7) for measuring creativity concerning the process strand.
Several approaches visioning the automation of the assessment adopt metrics with machine learning models to assess creativity (Manske and Hoppe, 2014) or to identify original projects Mustafaraj et al., 2017). Some apply machine learning models, including regression methods, such as linear regression and support vector regression (Manske and Hoppe, 2014), clustering methods, such as the Markov cluster algorithm  and the K-nearest neighbors' algorithm . The input for these algorithms included features gathered using statistical concepts, such as term frequency-inverse document frequency (TF-IDF)  and the Jaccard index of similarity . Some features were defined using software engineering metrics, such as effective lines of code, visited lines of code, and cyclomatic complexity (Manske and Hoppe, 2014). Abstract language tokens (obtained during the lexical analysis phase) were also used for comparing dis-  tances between artifacts using string metrics (Manske and Hoppe, 2014). In general, supervised learning methods were used to train machine learning models to assess creativity (Manske and Hoppe, 2014), while unsupervised learning methods were adopted for the identification of originality characteristic in projects Mustafaraj et al., 2017). Studying the relationship between computational creativity and computational thinking, Hershkovitz et al. (2019) use the Torrance Tests of Creative Thinking (TTCT) -Figural Test to capture the creative process. The output from the TTCT is then compared with the statistical infrequency of products, or solutions to programming exercises, created by the students. The Computational Thinking Pattern Analysis (CTPA) uses mathematical concepts to analyze creativity divergence in the student's programming solutions compared to patterns previously defined (Bennett et al., 2013;Koh et al., 2011). The authors used the cosine to provide a result regarding the difference between the tutorial and the student's computer program.
Regarding scalability, automation, and robustness, some approaches also use automated analysis such as the CTPA divergence analysis (Bennett et al., 2013;Koh et al.,  2011) or the identification of originality in projects Mustafaraj et al., 2017). The automated analysis allows anyone (including the students themselves) to assess the products quickly and receive instant feedback on their performance (Table 7). Regarding instructional feedback and grading, the approaches typically calculate a score for the student's computer program. Depending on the characteristic being assessed, some approaches use rating scales with performance levels specifying more complex product characteristics as the score increases. Others, such as Koh et al. (2011) Bennett et al. (2013, calculate the scores mathematically (Table 8). On the one hand, approaches adopting machine learning classification models provide a result that indicates if the computer program was classified as original or unoriginal, without assigning a score. On the other hand, machine learning regression models provide a score based on the expert rating scale in the datasets (Table 8). None of the approaches presents instructional feedback such as tips or suggestions to constructively guide the learning process based on the assessment results.

What is the Context and Sample Size of the Application of the Approach?
The majority of the approaches were applied in face-to-face K-12 school classes (Fig. 8), such as a middle school in a large urban school district in the Western US (Grover et al., 2018), a primary school in Spain (Hershkovitz et al., 2019), and primary school in Changshu City of China (Zhong et al., 2015). Blended applications with face-to-face and online classes were applied in the middle school context (Bennett et al., 2013;Koh et al., 2011). Face-to-face university classes included a course at Wellesley College in the US  and the Université Laval in Canada (Romero et al., 2017). Table 8 Assessment strategies for providing scores

Assessment strategy Reference
Math formula for assessing a score on divergence (originality) Bennett et al., 2013;Koh et al., 2011 4-point rating scale for assessing scores on product creativity characteristics Grover et al., 2018;Basu, 2019 Thresholds for rarity or complementary to 100% of the frequency of the solution among all the correct solutions for assessing a score on originality Hershkovitz et al., 2019;Gal et al., 2017 Labeling as original or unoriginal using Machine Learning models for assessing originality Mustafaraj et al., 2017;Turbak et al., 2017 7-star rating scale by experts and as an input in Machine Learning models for assessing a score on product creativity Manske and Hoppe, 2014 8 points for assessing scores on product creativity characteristics Romero et al., 2017 5-point rating scale for assessing scores on product creativity characteristics Zhong et al., 2015 Some of the approaches have also been evaluated by using projects from repositories providing support for the sharing of projects among students, from which solutions created by students can be downloaded and analyzed. Here, specifically, the App Inventor Gallery was used  as well as the Project Euler (Manske and Hoppe, 2014) to obtain thousands of students' projects.
The sample size varies from small samples in the university context  and face-to-face classes (Hershkovitz et al., 2019) to large samples obtained from online project galleries Manske and Hoppe, 2014). Some values were inferred based on the numbers provided by the authors, for example, Manske and Hoppe indicate that Project Euler (online gallery) had 4099877 solutions in November 2012, yet do not explicitly state if all solutions were used for the analysis. Therefore, we assumed that the sample size is equal to the number of projects by indicating this as an inferred value in Table 9.

If, and How the Approach Has Been Evaluated?
Most articles do not present an evaluation of the approach, as this may have been outside the scope of the articles. Furthermore, Hershkovitz et al. (2019) assume the reliability and validity of the psychological test (TTCT) as the test has been widely evaluated beforehand.
Exceptions are Basu (2019) for the evaluation of the defined rubric as well as Manske and Hoppe (2014) and Mustafaraj et al. (2017) related to the machine-learning- Fig. 8. Context of data (application) of the works presented in the selected articles. Table 9 Sample size variations per educational stage

Context
Subjects Sample analyzed Face-to-face K-12 school classes~80 to ~214 students~80 to ~1332 projects Face-to-face and online K-12 school classes~2 96 students (inferred from projects analyzed)~2

projects
Face-to-face university classes~16 to ~120 students~120 to 902 projects Project galleries~6000 to ~260717 users (inferred)~4099877 projects (inferred) based approaches proposed. Basu (2019) used Cohen's kappa coefficient to analyze the inter-rater reliability of the proposed rubric. A value of 0.9 was found by scoring discrepancies among teacher's scores. After computing the coefficient, teachers also scored additional projects independently providing additional opportunity to refine the rubric based on their feedback.
Manske and Hoppe (2014) performed a reliability evaluation on the agreement of expert assessments, which were used as an input to the proposed machine learning model. Yet, the evaluation of the machine learning model did not provide meaningful results due to the lack of agreement between raters. As raters used individual creativity definitions in a not consistent way, this resulted in two different groups of agreement measured via Krippendorff's alpha coefficient, with low values (below 0.3) indicating no agreement between raters. However, they found a high agreement within the specific theorists-group from the educational context (Krippendorff's coefficient 0.729) and medium agreement within software engineering related experts from industry (Krippendorff's coefficient 0.552), indicating that the two groups can be separated in terms of assessing product creativity. Mustafaraj et al. (2017) analyzed the accuracy of the classification of original and unoriginal projects regarding the Jaccard distance. They found an accuracy of 89% for both classes using a 0.4 Jaccard distance. A value less or greater than this results in diminishing the accuracy from one class, thus labeling it wrong, e.g. a more than 11% original projects may be labeled unoriginal if the value for distance is not 0.4. This result is particularly related to classification approaches for determining proper thresholds.
Some authors also mention some sort of evaluation of the proposed approaches without providing further details. Grover et al. (2018) report joint discussions about a few projects to establish interrater reliability yet without providing further information. Koh et al. (2011) argued that the divergence calculation used in their approach is supported by other data sources and that the validity of the approach demonstrates uniqueness in three separate learning conditions. Hershkovitz et al. (2019) compared the results from the TTCT test for creative thinking with the assessment of originality of the students' computer programs to well-defined problems. They found significant correlations between the two types of creativity measures and in some cases "creativity in programming is positively associated with the broad construct of creativity" (Hershkovitz et al., 2019).

Discussion
Considering the importance of creativity as a 21st-century skill, only very few assessment approaches have been encountered in the context of computing education with active learning strategies for assessing the student's creative product. And, although there exist already a considerable number of approaches for assessing computing education in general, these mostly focus exclusively on computational thinking concepts and practices (Moreno-León and Robles, 2015;Gresse von Wangenheim et al., 2018). Only very few of them also include assessment criteria related to creativity on the strand of the product Basu 2019), some of them in a superficial way as one subjective criterion to be judged manually by the instructor or peer. Popular automated assessment tools for assessing outcomes of computing education created with blockbased programming languages such as Dr. Scratch (Moreno-León and Robles, 2015) or CodeMaster (Gresse von Wangenheim et al., 2018) also do not assess creativity.
Analyzing specifically approaches for assessing product creativity of computer programs created by students as part of computing education, it becomes clear that the definition of creativity strongly influences how approaches assess the product. Considering that the term depends on many variables, it is not sufficient to only define which strand is being analyzed. This issue is further complicated through context-dependency as characteristics analyzed can vary as well have different meanings for the same terms in different contexts, for example, the usefulness of an app can be understood as if the app allows the user to perform the desired tasks, while related to a game it can be seen as if the game is pleasurable to play. In this context, it seems that originality is one of the few well-established characteristics in the literature regarding product creativity.
Most of the approaches are not based on well-known product creativity assessment models and simply use the term with a meaning of general common sense without presenting an in-depth analysis of the field for defining what comprises the assessment of product creativity. Some of the approaches focus on one product characteristic as a way of assessing creativity, excluding other characteristics typically considered of general product creativity assessment. As we also included originality in the search string, some approaches that focus exclusively on originality (not on the "creativity" assessment) do not assess creativity itself and aim at assessing originality as a singular construct, which may not provide an in-depth analysis of creativity. Approaches that assess condensation include those that define the term using well-known models as well as others defining the concept only superficially. Terms used for condensation are completeness, elegance and sophistication or engagement. Appropriateness is defined only by two approaches and, again, one of them uses well-known models for its definition, while the other defines it superficially. The two terms related to appropriateness are correctness and usefulness.
As the definition of creativity also depends on the specific context, the existing approaches tailor well-known product creativity assessment characteristics to the context of computing education. Thus, the definition of each characteristic assessed is related to computing concepts. Originality is typically customized by comparing the students' computer programs to identify the frequency of different solutions. Condensation is customized as a well-designed interface usability or software engineering metrics regarding the software architecture of the computer program (Manske and Hoppe, 2014). Appropriateness is measured by comparing the output of the students' computer program with the desired output for well-defined activities (Manske and Hoppe, 2014). However, for ill-defined activities so far there are only manual approaches that tailor creativity assessment to assess usefulness depending on subjective criteria such as if the program runs with many/few errors.
Only a few approaches explicitly define customizations, yet, lacking the indication of a theoretic background for several definitions. This shows that it is imperative to move towards a more precise definition of creativity in the context of computing education.
This would provide a shared understanding of the construct as a basis for the design and development of reliable and valid assessments.
Another issue is related to the object being assessed. In the context of product assessments, the approaches encountered are based on the assessment of a single product (in this case, computer programs). However, as typically creating a product includes also creating other intermediate outcomes, it may be important to consider not only the single end result of the learning process for the assessment but also these intermediate outcomes, such as requirements specifications, the interface design and/or test cases in the context of software development.
Half of the approaches encountered propose manual assessments, yet, these may be biased and time-intensive to complete. This becomes especially problematic in the context of large classes or Massive Open Online Courses (MOOCs). Even if manual assessments provide a context tailored result, it may be impossible to provide constant timely feedback throughout the learning process. And, although such manual assessments can also rely on peers with results that align with instructor assessments reducing the instructor's assessment effort, they are still subjective and require substantial time and organization . These reasons may limit the utility of manual approaches as the sole assessment alternative.
Yet, as creativity is complex and multi-dimensional it can be expressed in diverse ways, just an automated assessment of computer programs may not be sufficient as a single way to account for all its facets. Product-oriented approaches to the assessment of creativity are sometimes criticized for under-representing the creativity of individuals (Couger and Dangate, 1996). Thus, in order to capture creativity in a more comprehensive way it may be beneficial to adopt diverse approaches, e.g., completing an automated objective assessment of the product, through the manual subjective assessment by peers and/or instructors. However, none of the approaches we encountered suggests such a strategy. Only one approach compares the results of product assessment with the results of the TTCT test, yet, to study the relationship between creativity and computational thinking rather than to provide a holistic assessment of creativity.
In order to properly assess the concept, it is essential to provide a robust assessment model. Considering the complexity of the concept of product creativity and the lack of well-established definitions, a further shortcoming observed is the lack of more scientifically rigorous evaluations concerning the reliability and validity of the proposed assessment approaches. And, although several studies are based on considerable samples, typically using artifacts from product sharing platforms, such as App Inventor Gallery and Project Euler, these may not provide detailed context information. Thus, it is not possible to analyze these approaches for specific educational stages as the data comes from unknown origins. Therefore, the data of these studies may not be representative of the specific educational stage nor the specific target population with whom the approach will be used.
Given that assessors are critical in manual assessments, understanding who they are and what their level of expertise is, is also important as it has a direct impact on inter-rater agreement and reliability. As the human judgment of creativity remains by nature an intrinsically subjective process, it is necessary to study to which degree the perception of creativity of computer programs is consistent and not being an idiosyncratic result of an assessor's subjective judgment. In this regard, Manske and Hoppe (2014) found that assessors with an educational theoretical background are more likely to provide a consistent assessment of product creativity in the context of computing education. Taking into consideration that currently teachers formally trained in computer science are scarce in K-12, as well as self-or peer-assessment conducted by students still learning computing, this question has to be considered carefully in the design of the assessment instruments. This is particularly important in order to assure consistency regarding the meaning of assessment criteria and the performance levels by the assessors.
In general, the approaches only indicate as a result of the assessment a performance level typically on an ordinal scale. These scales are developed using Classical Test Theory, representing creativity of the product as the sum of the score, e.g., in Zhong et al. (2015). Alternatives such as a definition of a scale based on Item Response Theory may be a more appropriate way of creating a construct for assessing product creativity. However, none of the encountered approaches uses Item Response Theory, although Myszkowski and Storme (2019) argue that Item Response Theory-based scoring can lead to a more appropriate and accurate estimation of the latent trait (the creative value of the product), questioning also common practices regarding the aggregation of ratings.
The results of our analysis also point out a lack of the provision of more comprehensive instructional feedback. Except for a grade, no additional instructional feedback is given to the students based on the assessment results to guide the learning process. Such feedback, given constructively is important for the student to understand what are the strengths and weaknesses of the product, and consequently learning opportunities, as well as to the instructor to improve retention and knowledge transfer, e.g. a comprehensive explanation of why a student product is considered unoriginal. Most of the approaches also do not propose how to use these results as part of a summative assessment for grading. For example, Manske and Hoppe (2014) propose to use the score to classify the student's computer program. Different to other automated approaches in the context of computing education such as Dr. Scratch (Moreno-León and Robles, 2015) or CodeMaster (Gresse von Wangenheim et al., 2018), none of the approaches uses any kind of ludic representation of the results of the assessments (such as badges, ninja belts, etc.) to motivate students, especially in K-12.
Another issue that seems not to be considered is to which respect the assessment itself may inhibit creativity in schools, as learning experiences that involve comparisons to others, emphasis on extrinsic features of the task, and the pressure of being evaluated may cause anxiety and impair motivation and capacity for creativity (Runco, 2003). Thus, these impacts of the assessment also need to be studied and taken into consideration when designing the assessment to minimize their consequences.
Considering that creativity is a central competence of the 21st century, the lack of wider research on the assessment of learning creativity on the strand of the product as part of computing education is surprising. Although the study of creativity in computing education on the strand of the process seems to be more approached, this indicates a need for future work in the area of the product to effectively and efficiently support the teaching and learning of creativity as part of computing education in practice.
The availability of reliable and valid approaches is also essential to systematically create a body of empirical evidence supporting the assumption that computing education also contributes to the development of creativity, especially on the strand of the product, as systematic research on this issue is still scarce mostly dating back to the 1980s and 1990s (Clements, 1995).
Threats to Validity. Systematic mappings may suffer from the omission of relevant studies. In order to mitigate this risk, we carefully constructed the search string to be as inclusive as possible, considering not only core concepts but also synonyms. We also included originality as one of the characteristics widely present for product creativity to include works analyzing this specific characteristic. We also searched multiple databases indexed by Scopus, which covers the majority of scientific publications in this field. Threats to the study selection and data extraction were mitigated by providing a detailed definition of the inclusion and exclusion criteria. We defined and documented a rigid protocol for the study selection and the selection was conducted by all co-authors together until consensus was achieved. The lack of a clear definition of product creativity in the context of computing education was also mitigated using a set of keywords related to the characteristics of general product creativity. And, although we found only 11 articles describing 10 approaches, the overview presented here shows an in-depth analysis of important aspects regarding the assessment of creativity proposed by the approaches. Data extraction was inferred in some cases, as the relevant information was not always explicitly reported. In these cases, the inference made by the first author and carefully reviewed by the co-authors was indicated throughout the article.

Conclusions
From this review, it becomes evident that despite a current trend towards the teaching and learning of creativity in K-12, approaches for the assessment of product creativity in the context of computing education are just emerging. We only encountered 10 relevant approaches aiming at the assessment of computer programs created mostly with block-based programming languages, such as Scratch and App Inventor typically used in K-12 as well as few targeting higher education. These revealed a lack of a commonly accepted definition of product creativity customized to the context of computer education, confirming only originality as one of the well-established characteristics. Furthermore, several approaches seem to lack clearly defined criteria for effective, efficient and useful creativity assessment, especially in K-12. Diverse techniques are used including rubrics, mathematical models as well as machine learning, supporting manual as well as automated approaches. However, very few performed an evaluation analysis of the proposed approach, thus not providing results indicating reliability and validity. These results indicate the need for further research to support the assessment of product creativity in the context of computing education in a more effective, efficient way that can easily be adopted in educational practice as well as the need for more robust (systematic defined and validated) assessment models for creating an empirical basis to study the development of creativity in K-12 educational contexts.