Approaches to Assess Computational Thinking Competences Based on Code Analysis in K-12 Education : A Systematic Mapping Study

As computing has become an integral part of our world, demand for teaching computational thinking in K-12 has increased. One of its basic competences is programming, often taught by learning activities without a predefined solution using block-based visual programming languages. Automatic assessment tools can support teachers with their assessment and grading as well as guide students throughout their learning process. Although being already widely used in higher education, it remains unclear if such approaches exist for K-12 computing education. Thus, in order to obtain an overview, we performed a systematic mapping study. We identified 14 approaches, focusing on the analysis of the code created by the students inferring computational thinking competencies related to algorithms and programming. However, an evident lack of consensus on the assessment criteria and instructional feedback indicates the need for further research to support a wide application of computing education in K-12 schools.


Introduction
The digital age has transformed the world and workforce, making computing and IT technologies part of our daily lives.In this context, it becomes imperative that citizens have a clear understanding of the principles and practice of computer science (CSTA, 2016).Therefore, several initiatives have emerged around the world to popularize the teaching of computing including it into K-12 education (Bocconi et al., 2016).Teaching computing in school focuses on computational thinking, which refers to expressing solutions as computational steps or algorithms that can be carried out by a computer (CSTA, 2016).It involves solving problems, designing systems, and understanding human be-havior, by drawing on the concepts fundamental to computer science (Wing, 2006).Such a competence is valuable well beyond the computing classroom, enabling students to become computationally literate and fluent to fully engage with the core concepts of computer science (Bocconi et al., 2016;CSTA, 2016).
Computing is typically taught by creating, testing and refining computer programs (Shute et al., 2017;CSTA, 2016;Lye and Koh, 2014;Grover and Pea, 2013).In K-12 education, block-based visual programming languages, such as Scratch (https:// scratch.mit.edu),Blockly (https://developers.google.com/blockly/),BYOB/Snap!(http://snap.berkeley.edu)or App Inventor (http://appinventor.mit.edu/explore/)can be used to teach programming (Lye and Koh, 2014).Typically, programming courses include hands-on programming activities to allow students to practice and explore computing concepts as part of the learning process (Wing, 2006;Grover and Pea, 2013;Lye and Koh, 2014).This includes diverse types of programming activities, including closed and open-ended problems for which a correct solution exists (Kindborg and Scholz, 2006).Many computational thinking activities also focus on creating solutions to real-world problems, where solutions are software artifacts, such as games/animations or mobile apps (Monroy-Hernández and Resnick, 2008;Fee and Holland-Minkley, 2010).In such constructionist-based problem-based learning environments, student learning centers on complex ill-structured, open-ended problems for which no single correct answer exist (Gijselaers, 1996;Fortus et al., 2004;Lye and Koh, 2014).
An essential part of the learning process is assessment and feedback (Hattie and Timperley, 2007;Shute, 2008;Black and Wiliam, 1998).Assessment guides student learning and provides feedback to both the student and the teacher (Ihantola et al., 2010;Stegeman et al., 2016).For effective learning, students need to know their level of performance on a task, how their own performance relates to good performance and what to do to close the gap between those (Sadler, 1989).Formative feedback, thus, consists of informing the student with the intention to modify her/his thinking or behavior for the purpose of improving learning (Shute, 2008).Summative assessment aims to provide students with information concerning what they learned and how well they mastered the course concepts (Merrill et al., 1992;Keuning et al., 2016).Assessment also helps teachers to determine the extent to which the learning goals are being met (Ihantola et al., 2010).
Despite the many efforts aimed at dealing with the assessment of computational thinking (Grover and Pea, 2013;Grover et al., 2015), so far there is no consensus nor standardization on strategies for assessing computational thinking (Brennan and Resnick, 2012;Grover et al., 2014).There seems to be missing a clear definition of which assessment type to use, such as standardized multiple-choice or performance assessments based on the analysis of the code developed by the students.In this context, performance assessment seems to have a number of advantages over traditional assessments due to its capacity to assess higher-order thinking (Torrance, 1995;Ward and Lee, 2002).Thus, analyzing the student's code with respect to certain qualities may allow to assess the student's ability to program and, thus, to infer computational thinking competencies (Liang et al., 2009;Moreno-León et al., 2017).Yet, whereas the assessment of well-structured programming assignments with a single correct answer is straightforward (Funke, 2012), assessing complex, ill-structured problems for which no single correct solution exist is more challenging (Eseryel et al., 2013;Guindon, 1988).In this context, the assessment can be based on the assumption that certain measurable attributes can be extracted from the code, evaluating whether the students' programs show that they have learned what is expected by using rubrics.Rubrics use descriptive measures to separate levels of performance on the achievement of learning outcomes by delineating the various criteria associated with learning activities, and indicators describing each level to rate students' performance (Whittaker et al., 2001;McCauley, 2003).When used in order to assess programming activities, a rubric typically maps a score to the ability of the student to develop a software artifact indirectly inferring the achievement of computational thinking competencies (Srikant and Aggarwal, 2013).Grades are determined by converting rubric scores to grades.Thus, the created outcome is assessed and a performance level for each criterion is assigned as well as a grade in order to provide instructional feedback.
Rubrics can be used manually to assess programming activities, yet being a timeconsuming activity representing considerable effort (Keuning et al., 2016), which may also hinder scalability of computing education (Eseryel et al., 2013;Romli et al., 2010;Ala-Mutka, 2005).Furthermore, due to a critical shortage of K-12 computing teachers (Grover et al., 2015), many non-computing teachers introduce computing in an interdisciplinary way into their classes, facing challenges also with respect to assessment (DeLuca and Klinger, 2010;Popham, 2009;Cateté et al., 2016;Bocconi et al., 2016).This may further complicate the situation leaving the manual assessment error prone due to inconsistency, fatigue, or favoritism (Zen et al., 2011).
In this context, the adoption of automatic assessment approaches can be beneficial easing the teacher's workload by enabling them to focus on the manual assessment of complex, subjective aspects such as creativity and/or leaving more time for other activities with students (Ala- Mutka and Järvinen, 2004).It can also help to ensure consistency and accuracy of assessment results as well as eliminate bias (Ala-Mutka, 2005;Romli et al., 2010).Students can use this feedback to improve their programs and programming competencies.It can provide instant real-time instructional feedback in a continuous way throughout the learning activity, allowing them to make progress without a teacher by their side (Douce et al., 2005;Koh et al., 2014a;Wilcox, 2016;Yadav et al., 2015).Thus, automating the assessment can be beneficial for both students and teachers, improving computing education, even more in the context of online learning and MOOCs (Vujosevic-Janicic et al., 2013).
As a result, automated grading and assessment tools for programming exercises are already in use in many ways in higher education (Ala-Mutka, 2005;Ihantola et al., 2010;Striewe and Goedicke, 2014).They typically involve static and/or dynamic code analysis.Static code analysis examines source code without executing the program.It is used for programming style assessment, syntax and semantics errors detection, software metrics, structural or non-structural similarity analysis, keyword detection or plagiarism detection, etc. (Truong et al., 2004;Song et al., 1997;Fonte et al., 2013).Dynamic approaches focus on the execution of the program through a set of predefined test cases, comparing the generated output with the expected output (provided by test cases).The main aim of dynamic analysis is to uncover execution errors and to evaluate the correctness of a program (Hollingsworth, 1960;Reek, 1989;Cheang et al., 2003).And, although there exist already a variety of reviews and comparisons of automated systems for assessing programs, they are targeted on text-based programming languages (such as Java, C/C++, etc.) in the context of higher education (Ala-Mutka, 2005;Ihantola et al., 2010;Romli et al., 2010;Striewe and Goedicke, 2014;Keuning et al., 2016).
Thus, the question that remains is which approaches exist and what are their characteristics to support the assessment and grading of computational thinking competencies, specifically on the concept of algorithms and programming, based on the analysis of code developed with block-based visual programming languages in K-12 education.

Related Work
Considering the importance of (automated) support for the assessment of practical programming activities in computing education, there exist several reviews of automated assessment approaches.These reviews analyze various aspects, such as feedback generation, implementation aspects as well as the impact of such tools on learning and teaching (Ala-Mutka, 2005;Ihantola et al., 2010;Romli et al., 2010;Caiza and Del Alamo, 2013;Striewe and Goedicke, 2014;Keuning et al., 2016).Ala-Mutka (2005) presents an overview on several automatic assessment techniques and approaches, addressing the aspect on grading as well as the instructional feedback provided by the tools.Ihantola et al. (2010) present a systematic review of assessment tools for programming activities.The approaches are discussed from both a technical and pedagogical point of view.Romli et al. (2010) describe how educators taught programming in higher education as well as indicating preferences of dynamic or static analysis.Caiza and Del Alamo (2013) analyze key features related to the implementation of approaches for automatic grading, including logical architecture, deployment architecture, evaluation metrics to display on how the approach can establish a grade, and technologies used by the approaches.Striewe and Goedicke (2014) present the relation of technical results of the automated analysis in object-oriented programming with didactic benefits and the generation of feedback.Keuning et al. (2016) review the generation of automatic feedback.They also analyze the nature, the techniques used to generate feedback, the adaptability of tools for teachers to create activities and influence feedback, and synthesize findings about the quality and effectiveness of the assessment provided by the tools.
However, these existing reviews focus on approaches to assess code created with text-based programming languages in the context of teaching computing in higher education.Differently, the objective of this article is to provide an in-depth analysis of existing approaches for the assessment of programming activities with block-based visual programming languages in the context of K-12 education.

Definition and Execution of the Systematic Mapping Study
In order to elicit the state of the art on approaches to assess computer programs developed by students using block-based visual programming languages in K-12 education, we performed a systematic mapping following the procedure defined by Petersen et al. (2008).

Definition of the Review Protocol
Research Question.Which approaches exist to assess (and grade) programming activities based on code created with block-based visual programming languages in the context of K-12 education?
We refined this research question into the following analysis questions: Program analysis • AQ1: Which approaches exist and what are their characteristics?AQ2: Which programming concepts related to computational thinking are analyzed?AQ3: How are these programming concepts related to computational thinking analyzed?Instructional feedback and assessment • AQ4: If, and how a score is generated?AQ5: If, and in which manner instructional feedback is presented?Automation of assessment • AQ6: If, and how the approach has been automated?Data source.We examined all published English-language articles that are available on Scopus being the largest abstract and citation database of peer-reviewed literature, including publications from ACM, Elsevier, IEEE and Springer with free access through the Capes Portal1 .
Inclusion/exclusion criteria.We consider only peer-reviewed English-language articles that present an approach to the assessment of algorithms and programming concepts based on the analysis of the code.We only consider research focusing on approaches for block-based visual programming languages.And, although our primary focus is on K-12 education, we include also approaches that cover concepts commonly addressed in K-12, but which might have been used on other educational stages (such as higher education).We consider articles that have been published during the last 21 years, between January 1997 (the following year in which the first block-based programming language was created) and August 2018.
We exclude approaches that analyze code written with text-based programming languages, assess algorithms and programming concepts/practices based on other sources than the code developed by the student, such as tests, questionnaires, interviews, etc., or perform code analysis outside an educational context, e.g., software quality assessment.
Quality criteria.We consider only articles that present substantial information on the presented approach in order to enable the extraction of relevant information regarding the analysis questions.

Definition of search string:
In accordance with our research objective, we define the search string by identifying core concepts considering also synonyms as indicated in Table 1.The term "code analysis" is chosen, as it expresses the main concept to be searched.The terms "static analysis", "grading" and "assessment" are commonly used in the educational context for this kind of code analysis.The term "visual programming" is chosen to restrict the search focusing on visual programming languages.We also include the names of prominent visual programming languages used in K-12 as synonyms.
Using these keywords, the search string has been calibrated and adapted in conformance with the specific syntax of the data source as presented in Table 2.

Execution of the Search
The search has been executed in August 2018 by the first author and revised by the coauthors.In the first analysis stage, we quickly reviewed titles, abstracts and keywords to identify papers that matched the exclusion criteria, resulting in 36 potentially relevant articles based on the search results.In the second selection step, we analyzed the complete text of the pre-selected articles in order to check their accordance to the inclusion/ exclusion criteria.All authors participated in the selection process and discussed the selection of papers until a consensus was reached.Table 3 presents the number of articles found and selected per stage of the selection process.Many articles encountered in the searches are outside of the focus of our research question aiming at other forms of assessment such as tests (Weintrop and Wilensky, 2015) or other kinds of performance results.Several articles that analyze other issues such as, for example, the way novice students program using visual programming languages (Aivaloglou and Hermans, 2016) or common errors in Scratch programs (Techapalokul, 2017) have also been left out.Other articles have been excluded as they describe approaches to program comprehension (e.g., Zhang et al., 2013;Kechao et al., 2012), rather than the assessment of students' performance.We discovered a large number of articles presenting approaches for the assessment of code created with text-based programming languages (e.g.Kechao et al., 2012), not considering block-based visual programming languages, which, thus, have been excluded.Articles that present an approach for evaluating other topics that do not include algorithms and programming, e.g., joyfulness and innovation of contents, were also excluded (Hwang et al., 2016).Complete articles or work in progress that meet all inclusion criteria, but do not present sufficient information with respect to the analysis question have been excluded due the quality criterion (e.g., Grover et al., 2016;Chen et al., 2017).In total, we identified 23 relevant articles with respect to our research question (Table 4).
All relevant articles encountered in the search were published within the last nine years as shown in Fig. 1.

Data Analysis
To answer the research question, we present our findings with respect to each of the analysis questions.

Which Approaches Exist and what are their Characteristics?
We found 23 relevant articles describing 14 different approaches, as some of them present the same approach just from a different perspective.Most approaches have been developed to assess code created with Scratch.The approaches also differ with respect to the type of programming activity for which they are designed as shown in Table 5.All approaches carry out a code analysis aiming at measuring the competence of programming concepts as a way of assessing computational thinking.Performing a product-oriented analysis, analyzing the code itself, the approaches look for indicators of concepts of algorithms and programming related to computational thinking practices.
In accordance to the CSTA K-12 computer science framework (CSTA, 2016), most approaches analyze four of the five subconcepts related to the core concept algorithms and programming: control, algorithms, variables and modularity (Table 6).None of the approaches explores the subconcept program development.Thus, although, not directly measuring all the dimensions of computational thinking, these approaches intend to assess computational thinking indirectly by measuring algorithms and programming concepts through the presence of specific algorithms and program commands.Other approaches, such as CTP (Koh et al., 2014b), analyze computational thinking patterns, including generation, collision, transportation, diffusion and hill climbing.Some approaches also outline the manual analysis of elements related to the content of the program developed, such as creativity and aesthetics (Kwon and Sohn, 2016b;Werner et al., 2012).The functionality of the program is analyzed only by dynamic code analysis or manual approaches.Two approaches (Kwon and Sohn, 2016b;Denner et al., 2012) analyze the completeness level by analyzing if the program has several functions, or in case of games, several levels.Ninja Code Village (Ota et al., 2016) also analyzes if there are general-purpose functions in the code, for example, if, in a game, a "jump" function has been implemented.Three approaches analyze code organization or documentation, e.g., meaningful naming for variables or creating procedures to organize the code (Denner et al., 2012;Gresse von Wangenheim et al., 2018;Funke et al., 2017).Only two (manual) approaches analyze elements related to usability (Denner et al., 2012;Funke et al., 2017).
Some approaches also analyze specific competences regarding the characteristics of the programming language and/or program type, such as computational thinking patterns in games (Koh et al., 2014b).However, just based on the code created it may not possible to analyze fundamental practices, such as recognition and definition of computational problems (Brennan and Resnick, 2012).The assessment of other complex aspects, such as creativity is also difficult to automate, reflected by the fact that no automated approach with respect to this criterion has been encountered.To evaluate these topics other complementary forms of evaluation should be used, such as, artifact-based interviews and design scenarios (Brennan and Resnick, 2012).

How are these Programming Concepts Related to
Computational Thinking Analyzed?
The approaches analyze code in different ways, including automated static or dynamic code analysis or manual code analysis.The majority of the encountered approaches uses static code analysis (Table 7).This is also related to the fact that the type of analysis depends on the type of programming activity.Only in case of open-ended well-structured problems with a solution known in advance, it is possible to compare the students' program code with representations of correct implementations for the given problem, thus allowing dynamic code analyses.
All approaches that focus on the analysis of activities with open-ended ill-structured problems are based on static code analysis, detecting the presence of command blocks.This allows identifying which and how often each command is used.In order to measure computational thinking competences, static approaches analyze the code in order to detect the presence of specific program commands/constructs inferring from their presence the learning of algorithms and programming concepts.For example, to measure competence with respect to the subconcept control, that specifies the order in which This type of approach assumes that the presence of a specific command block indicates a conceptual encounter (Brennan and Resnick, 2012).
Based on the identified quantities of certain commands, further analyses are performed, for example, calculating sums, averages and percentages.The results of the analysis are presented in various forms, including charts with different degrees of detail.For example, the Scrape tool (Wolz et al., 2011) presents general information about the program, the percentage of each command present in the project as well as the exact number of times each command was used per category.Some approaches present the results of the static analysis on a more abstract level beyond the quantification of commands (Moreno-León and Robles, 2015;Ota et al., 2016;Gresse von Wangenheim et al., 2018).An example is the Dr. Scratch tool (Moreno-León and Robles, 2015) that analyzes concepts such as abstraction, logic and parallelism providing a score for each concept based on a rubric.
Programming activities with open-ended well-structured problems can also be assessed adopting static code analysis, typically by comparing the students' code with the representation of the correct solution pre-defined by the instructor.In this case, the analysis is carried out by checking if a certain set of commands is present in the student's program (e.g., (Koh et al., 2014a)).Yet, this approach requires that for each programming exercise the teacher or the instructional designer previously programs a model solution.
Some approaches adopt a dynamic code analysis (e.g., (Maiorana et al., 2015)).In this case, tests are run in order to determine if the solution of the student is correct based on the output produced by the program.However, adopting this approach requires at least the predefinition of the requirements and/or test cases for the programming activity.Another disadvantage of this kind of black-box testing is that it examines the functionality of a program without analyzing its internal structure.Thus, a solution that generates a correct result may be considered correct, even when not using the intended programming constructs, e.g., repeating the same command several times instead of using a loop construct.In addition to these basic types of analysis, ITCH (Johnson, 2016) adopts a hybrid approach combining dynamic analysis (through custom tests) and static analysis for open-ended well-structured problems with a correct solution known in advance.
Several approaches rely on manual code analysis for either type of activity (with or without a solution known in advance) typically using rubrics.An example is the Fairy Assessment approach (Werner et al., 2012) using a rubric to assess the code for openended well-structured problem.The PECT approach presents a rubric to perform manual analysis for open-ended ill-structured problems (Seiter and Foreman, 2013).

How do the Approaches Provide Instructional Feedback?
The assessment of the students' computational thinking competence is done by using different forms of grading.Some approaches use a dichotomous scoring attribute, assessing the correctness of a program as a whole, e.g., indicating if it is right or wrong.
An example is Quizly (Maiorana et al., 2015) that tests the program of the student and, then shows a message indicating if the program is correct or incorrect as well as the error occurred.
Several approaches divide the program or competencies into areas and assign a polytomous score for each area.Therefore, a single program can receive different scores for each area (Boe et al., 2013).An example is Hairball (Boe et al., 2013), which labels each area as (i) correct, when the concept was implemented correctly, (ii) semantically incorrect, when the concept was implemented in a way that does not always work as intended, (iii) incorrect, when it was implemented incorrectly, or (iv) incomplete, when only a subset of the blocks needed for a concept was used.
Some approaches provide a composite score based on each of these polytomous scores.An example is Dr.Scratch (Moreno-León and Robles, 2015) that analyzes seven areas and assigns a polytomous score to each area.In this case, it is assumed that the use of blocks of greater complexity, such as, "if then, else" implies higher performance levels than using blocks of less complexity such as "if then".A final score is as- x (sum of polytomous scores) Approach by Funke, Geldreich and Hubwieser x signed to the student's program based on the sum of the polytomous scores.This final composite score indicates a general assessment on a 3-point ordinal scale (basic, developing, master).The composite score is also represented through a mascot, adopting gamification elements in the assessment (Moreno-León and Robles, 2015).The tool also creates a customized certificate that can be saved and printed by the student.Similar, CodeMaster (Gresse von Wangenheim et al., 2018) assigns a polytomous score based on either its App Inventor or Snap!Rubric.A total score is calculated through the sum of the partial scores, and based on the total score a numerical grade and a ninja badge is assigned.
Another way of assigning a composite score is based on a weighted sum of the individual scores for each area or considering different variables.The approach presented by Kwon and Sohn (2016a) evaluates several criteria for distinct areas, each one with different weights.The CTP approach (Koh et al., 2014b) assigns a total score to the program based on primitives related to computational thinking patterns.
The assessment of the approaches is intended to be used in a summative and/or formative way.Few approaches provide feedback by explicitly giving suggestions or tips on how to improve the code (Table 9).
Feedback is customized according to the results of the code analysis and involves suggestions on good practices or modifications that can be made in order to achieve a higher score and to close the gap between what is considered good performance.None of the articles report in detail how this feedback is generated.However, it can be inferred that approaches, which perform a static code analysis create this feedback based on the results of the code analysis.On the other hand, feedback given by approaches using dynamic code analysis is based on the response obtained by the execution of the program.x Some approaches also provide tips that can be consulted at any time.Dr.Scratch (Moreno-León and Robles, 2015) provides a generic explanation on how to achieve the highest score for each of the evaluated areas.Similar, CodeMaster (Gresse von Wangenheim et al., 2018) presents the rubric used to assess the program.Quizly (Maiorana et al., 2015), along with the task description, provides a link to a tutorial on how to solve exercises.

Have the Approaches been Automated?
Only a few approaches are automated through software tools (Table 10).Most automated approaches perform a static code analysis.Few approaches use dynamic code analysis to assess the student's solution (Johnson, 2016;Maiorana et al., 2015;Ball, 2017).These software tools seem to be typically targeted at teachers and/or students, with few exceptions providing also features for administrators or institutional representatives (Maiorana et al., 2015).
In general, details about the implementation of the tools are not presented in the encountered literature, with few exceptions.Hairball (Boe et al., 2013) was developed as a set of scripts in Python using the object orientation paradigm, so that it can be extended and adapted to evaluate specific tasks.Dr. Scratch (Moreno-León and Robles, 2015)  was implemented based on Hairball.It has been implemented in the Python language developing plug-ins from Hairball.CodeMaster separates the analysis/assessment and presentation into different modules.The backend system was implemented in Java 8, running on an Apache Tomcat 8 application server using a MySQL 5.7 database.The front-end component was implemented in the JavaScript using the Bootstrap library with an additional custom layout (Gresse von Wangenheim et al., 2018).The access to the tools is either in the form of scripts, web, or desktop application.As a web application, Dr. Scratch, for example, allows the user to simply inform the Scratch project's URL or to upload the exported Scratch file to run the assessment.
Although, it was not possible to get detailed license information on all tools, we were able to access a free implementation of several tools (Table 10).Most tools are available in English only, with few exceptions providing internationalization and localization for several languages, such as Japanese, Spanish, Portuguese, etc., thus, facilitating a widespread adoption in different countries.

Discussion
Considering the importance of automated support for the assessment of programming activities in order to widely implement the teaching of computing in K-12 education, so far, only very few approaches exist.Most of the approaches focus on analyzing Scratch code, being currently one of the most widely used block-based visual programming languages, popular in several countries.For other block-based visual programming languages, very few solutions have been encountered.The approaches are intended to be used for formative and/or summative assessment in computing education.We encountered approaches for different types of programming activities including open-ended well-structured problems with a pre-defined correct or best solution as well as openended ill-structured problems in problem-based learning contexts.
Although the majority of the approaches aims at supporting the assessment and grading process of the teacher, some tools are also intended to be used by the students directly to monitor and guide their learning progress.Examples are Dr.Scratch (Moreno-León and Robles, 2015), CodeMaster (Gresse von Wangenheim et al., 2018) and CTP (Koh et al., 2014a) that even provides real-time feedback during the programming activity.The approaches typically provide feedback in form of a score based on the analysis of the code, including dichotomous or polytomous scores for single areas/concepts as well as composite scores providing a general result.Few approaches provide suggestions or tips on how to improve the code (in addition to a single score) in order to guide the learning process.Only two approaches (Moreno-León and Robles, 2015;Gresse von Wangenheim et al., 2018) use a kind of gamification by presenting the level of experience in a playful way with mascots.
For the assessment, the approaches use static, dynamic or manual code analysis to analyze the code created by the student.The advantage of static code analysis approaches measuring certain code qualities is that they do not require a pre-defined correct best solution, being an alternative for the assessment of ill-structured activities in problem-based learning contexts.However, the inexistence of a pre-defined solution for such ill-structured activities limits their analysis with respect to certain qualities, not allowing the validation of the correctness of the code.Yet, on the other hand, in order to stimulate the development of higher order thinking, ill-structured problems are important in computing education.Dynamic code analysis approaches can be applied for the assessment of open-ended well-structured problems for which a pre-defined solution exist.However, a disadvantage, as they do not consider the internal code structure, is that they may consider a program correct (when generating the expected output), even when the expected programming commands were not used.
By focusing on performance-based assessments based on the analysis of the code created by the students, the approaches infer an assessment of computational thinking concepts and practices, specifically related to the concept of algorithms and programming, based on the code.This explains the strong emphasis of the analysis of programming-related competences, assessing mostly algorithm and programming sub-concepts, such as, algorithms, variables, control and modularity by the majority of the approaches as part of computational thinking competences.Additional elements such as usability, code organization, documentation, aesthetics or creativity are assessed only by manual approaches.This current limitation of this kind of assessment based exclusively on the analysis of the code, also indicates the need for alternative assessment methods, such as observation or interviews in order to be able to provide a more comprehensive assessment, especially when regarding computational thinking practices and perspectives (Brennan and Resnick, 2012).In this respect, approaches based on code analysis can be considered one means for the assessment of computational thinking that especially, when automated, free the teacher to focus on complementary assessment methods, rather than being considered the only way of assessment.
We also observed that there does not seem to exist a consensus on the concrete criteria, rating scales, scores and levels of performance among the encountered approaches.Few articles indicate the explicit use of rubrics as a basis for the assessment (Seiter and Foreman, 2013;Werner et al., 2012;Moreno-León and Robles, 2015;Ota et al., 2016;Gresse von Wangenheim et al., 2018).This confirms the findings by Grover and Pea (2013) and Grover et al. (2015) that despite the many efforts aimed at dealing with the issue of computational thinking assessment, so far there is no consensus on strategies for assessing computational thinking concepts.
The approaches are designed for the context of teaching programming to novice students, mostly focusing on K-12.Some approaches are also being applied in different contexts including K-12 and higher education.However, with respect to K-12 education, none of the approaches indicates a more exact specification of the educational stage for which the approach has been developed.Yet, taking into consideration the large differences in child learning development stages in K-12 and, consequently the need for different learning objectives for different stages, as for example refined by the CSTA curriculum framework (CSTA, 2016), it seems essential to specify more clearly which educational stage is addressed and/or to provide differentiated solution for different stages.
Only some approaches are automated.Yet, some do not provide a user-friendly access, showing results only in a terminal that runs scripts, which may hinder their adop-tion.Another factor that may hinder the widespread application of the tools in practice is their availability in English only, with only few exceptions available in a few other languages.Another shortcoming we observed is that these tools are provided as standalone tools not integrated into programming environments and/or course management systems in order to ease their adopting in existing educational contexts.
These results show that, although there exist some punctual solutions, there is still a considerable gap not only for automated assessment tools, but also for the conceptual definition of computational thinking assessment strategies with respect to the concept of algorithms and programming.As a result of the mapping study several research opportunities in this area can be identified, including the definition of well-defined assessment criteria for a wide variety of block-based visual programming languages especially for the assessment of open-ended ill-defined activities aiming at the development of diverse types of applications (games, animations, apps, etc.).Observing also the predominant focus on the assessment of programming concepts, the consideration of other important criteria such as creativity could be important, as computing education is considered not only to teach programming concepts, but also to contribute to the learning of 21th century skills in general.Another improvement opportunity seems to be the provision of better instructional feedback in a constructive way that effectively guides the student to improve learning.We also observed a lack of differentiation between different stages of K-12 education ranging from kindergarten to 12th grade, with significant changes of learning needs at different educational stages.Thus, another research opportunity would be the study of these changing needs and characteristics with respect to different child learning development stages.Considering also practical restrictions and a trend to MOOCs, the development and improvement of automated solutions, which allow an easy and real-time assessment and feedback to the students, could further improve the guidance of the students as well as reduce the workload of the teachers, and, thus, help to scale up the application of computing education in schools.
Threats to Validity.Systematic mappings may suffer from the common bias that positive outcomes are more likely to be published than negative ones.However, we consider that the findings of the articles have only a minor influence on this systematic mapping since we sought to characterize the approaches rather than to analyze their impact on learning.Another risk is the omission of relevant studies.In order to mitigate this risk, we carefully constructed the search string to be as inclusive as possible, considering not only core concepts but also synonyms.The risk was further mitigated by the use of multiple databases (indexed by Scopus) that cover the majority of scientific publications in the field.Threats to study selection and data extraction were mitigated by providing a detailed definition of the inclusion/exclusion criteria.We defined and documented a rigid protocol for the study selection and the selection was conducted by all co-authors together until consensus was achieved.Data extraction was hindered in some cases as the relevant information has not always been reported explicitly and, thus, in some cases had to be inferred.In these cases, the inference was made by the first author and carefully reviewed by the co-authors.

Conclusions
In this article, we present the state of the art on approaches to assess computer programs developed by students using block-based visual programming languages in K-12 education.We identified 23 relevant articles, describing 14 different approaches.The majority of the approaches focuses on the assessment of Scratch, Snap! or App Inventor programs, with only singular solutions for other block-based programming languages.By focusing on performance-based assessments based on the analysis of the code created by the students, the approaches infer computational thinking competencies, specifically related to the concept of algorithms and programming, using static, dynamic or manual code analysis.Most approaches analyze concepts directly related to algorithms and programming, while some approaches analyze also other topics such as design and creativity.Eight approaches have been automated in order to support the teacher, while some also provide feedback directly to the students.The approaches typically provide feedback in form of a score based on the analysis of the code, including dichotomous or polytomous scores for single areas/concepts as well as composite scores providing a general result.Only few approaches explicitly provide suggestions or tips on how to improve the code and/or use gamification elements, such as badges.As result of the analysis, a lack of consensus on the assessment criteria and instructional feedback has become evident as well as the need of such support to a wider variety of block-based programming languages.We also observed a lack of contextualization of these approaches within the educational setting, indicating for example on how the approaches can be completed by alternative assessment methods such as observations or interviews in order to provide a more comprehensive feedback covering also concepts and practices that may be difficult to be assessed automatically.These results indicate the need for further research in order to support a wide application of computing education in K-12 schools.

Fig. 1 .
Fig. 1.Amount of publications presenting approaches on code analysis of visual programming languages in the educational context per year.

Table 3
Number of articles per selection stage

Table 4
Relevant articles found in the search ITCH Individual testing of computer homework for scratch assignments 14 (Seiter and Foreman, 2013) Modeling the Learning Progressions of Computational Thinking of Primary Grade Students Continued on next page

Table 4 -
continued from previous page (Ball and Garcia, 2016) code village for scratch -Function samples function analyser and automatic assessment of computational thinking concepts 16(Werner et al., 2012)The Fairy Performance Assessment Measuring 17(Denner et al., 2012)Computer games created by middle school girls: Can they be used to measure understanding of computer science concepts?18(Wolzetal., 2011)Scrape: A tool for visualizing the code of scratch programs 19(Maiorana et al., 2015)Quizly -A live coding assessment platform for App Inventor 20(Ball and Garcia, 2016)Autograding and Feedback for Snap!: A Visual Programming Language.

Table 5
Overview of the characteristics of the approaches encountered

Table 6
Overview on the analyzed elements , they check if the student used a loop command in the program.

Table 9
Overview on types of assessment and instructional feedback