A Proposal for Performance-based Assessment of the Learning of Machine Learning Concepts and Practices in K-12

. Although Machine Learning (ML) is used already in our daily lives, few are familiar with the technology. This poses new challenges for students to understand ML, its potential, and limitations as well as to empower them to become creators of intelligent solutions. To effectively guide the learning of ML, this article proposes a scoring rubric for the performance-based assessment of the learning of concepts and practices regarding image classification with artificial neural networks in K-12. The assessment is based on the examination of student-created artifacts as a part of open-ended applications on the use stage of the Use-Modify-Create cycle. An initial evaluation of the scoring rubric through an expert panel demonstrates its internal consistency as well as its correctness and relevance. Providing a first step for the assessment of concepts on image recognition, the results may support the progress of learning ML by providing feedback to students and teachers.


Introduction
Machine Learning (ML) has become part of our everyday life deeply impacting our society. Different from Artificial Intelligence, focusing on theory and development of computer systems able to perform tasks that normally require human intelligence, Machine Learning focuses on the development of systems that learn and improve from experience on their own without having to be explicitly programmed. Currently, ML is one of the most rapidly growing areas within artificial intelligence (Holzinger et al., 2018). Recent progress in ML has been specifically achieved by deep learning approaches using neural networks, dramatically improving the state-of-the-art in image recognition, object detection, and speech recognition in many domains (Jordan and Mitchell, 2015;LeCun et al., 2015).
Yet, most do not understand the technology behind it, which can make ML mysterious or even scary, overshadowing its potential positive impact on society (Evangelista et al., 2018;Ho and Scadding, 2019). Thus, to demystify what ML is, how it works and what are its impact and limitations, there is a growing need for public understanding of ML (House of Lords, 2018;Tuomi, 2018). Therefore, it becomes important to introduce basic concepts and practices already in school, empowering students to become more than just consumers, but also creators of intelligent solutions (Kandlhofer et al., 2016;Royal Society, 2017;Touretzky et al., 2019). Knowledge about ML concepts, the ability to use and create ML models, together with the ability to critically analyze benefits, social and ethical aspects of AI, are becoming key skills of the 21st century to educate the next generation as responsible citizens (Steinbauer et al., 2021;Touretsky et al., 2019).
And, although being a complex knowledge area, studies have shown that children are able to learn ML concepts from a relatively young age (Hitron et al., 2019). The introduction to this kind of complex knowledge also has the potential to improve children's everyday skills as well as to better prepare them to deal with challenges that arise as a result of the use of ML (Kahn et al., 2020).
It may also encourage more students to choose computing careers and provide adequate preparation for higher education taking into consideration a major shift in the labor market with a fast-growing need for ML-literate workers (Tuomi, 2018;Touretsky et al., 2019). Thus, teaching ML at K-12 not only helps young people to understand this emerging technology and how it works but can also inspire future ML users and creators to get acquainted with the world, to understand it, and to change it (Pedró et al., 2019;Webb et al., 2021).
As indicated by the curricular guidelines for teaching Artificial Intelligence (Touretzky et al., 2019), teaching AI in K-12 should also include Machine learning represented by Big Idea #3 -Learning (Fig. 1). Following these guidelines, teaching ML on this educational stage should include an understanding of basic ML concepts, such as learning algorithms and fundamentals of neural networks, as well as limitations and ethical concerns related to ML.
As ML is a complex knowledge area, it is important to carefully define the sequence of learning goals to be achieved with sufficient scaffolds for novices to start to create ML models with little instruction in the beginning to keep students engaged (low threshold) while also being able to support sophisticated programs with the learning progression (high ceiling). In this context, active learning that stresses action and direct experience is crucial to make ML transparent and enable students to build correct mental models (Wong et al., 2020). As a part of the human-centric development of an ML model, students can explore several tasks from preparing a dataset, selecting an appropriate learning algorithm, training the ML model, and evaluating its performance (Lwakatare et al., 2019;Ramos et al., 2020). Representing a complex area, the best approach is to start with lower-level competencies and then progress upwards. In order to guide the learning progression focusing on the application of ML concepts and practices, often the Use-Modify-Create cycle (Lytle et al., 2019) is also applied for ML education. Following this cycle, students are introduced to ML topics by using and analyzing a provided ML artifact as well as learning how to develop a predefined ML model, then modifying one, until creating their own ones.
Traditionally, Machine Learning has been taught mostly in higher education (McGovern et al., 2011;Torrey, 2012). And, although there are many programs today that focus on coding and robotics, K-12 education still needs to embrace the teaching of Artificial Intelligence, including ML (Hubwieser et al., 2015). However, various initiatives promoting ML education in K-12 have lately emerged, including several countries such as China introducing artificial intelligence and ML into curricula in primary and secondary schools Yang, 2019).
These instructional units teach competencies varying from presenting what is ML, to specific ML techniques, with an emphasis on artificial neural networks as well as the impacts of ML. Because of the complexity, several instructional units address only the most accessible processes, such as data management, while others cover the complete ML process in a simplified way black-boxing to different degrees some of the underlying ML processes. Typically, visual tools, such as Google Teachable Machine (Google, 2020) or customized solutions such as LearningML (Rodríguez García et al., 2020) or PIC (Tang et al., 2019) are adopted at this educational stage not requiring any programming. This allows students to execute an ML process in an interactive way using a trainfeedback-correct cycle, enabling them to evaluate the current state of the model and take appropriate actions (Gresse von Wangenheim et al., 2021). Most of these tools are available for free online using resources in the cloud to train the ML models enabling their adoption in schools with common computer labs and internet connections (Gresse von Wangenheim et al., 2021). These tools also allow the easy deployment of the created ML models into popular block-based environments, such as App Inventor, Scratch, or Snap!, which are used to teach computing in K-12.
As a part of the learning process, it is important to assess the students' learning by providing feedback to both the student and the teacher (Hattie & Timperley, 2007). For effective learning, students need to know their level of performance on a task, how their own performance relates to good performance and what to do to close the gap between those (Sadler, 1989). Despite the many efforts to address the assessment of computing education in K-12 settings, more emphasizes have been on computational thinking, algorithms & programming, and modeling & simulation (Lye & Koh, 2014;Tang et al., 2019;Yasar et al., 2016), while most instructional units on ML currently do not propose rigorous assessment solutions . Few ML courses include rather simple quiz-based assessments, while performance-based assessments are basically nonexistent. As one of the few existing studies, Sakulkueakulsuk et al. (2018) proposes an assessment based on the performance of the ML model created by the students, while AI Family Challenge (Technovation Families, 2019a) and Exploring Computer Science (2019) assess the outcome or students' presentation through rubrics. However, no further information on their design or evaluation has been encountered, thus their effectiveness and evidence for validity have remained questionable.
Therefore, this research aims to initiate the development of a scoring rubric for assessing the learning of ML concepts and practices focusing on image recognition with supervised learning. The rubric is defined as part of a performance-based assessment based on ML artifacts created by students as an outcome of the use stage in K-12. In this line, the following research questions were addressed: What is the evidence of internal consistency of the performance-based assess-(1) ment scoring rubric? What is the evidence of content validity of the performance-based assessment (2) scoring rubric?

Research Methodology
The development of the performance-based assessment is based on the method proposed by Moskal and Leyden (2000) and evidence-centered design (Mislevy et al., 2003), including the following phases: Content domain analysis. The content domain was analyzed through a systematic literature review on the definition of ML concepts and practices as well as learning objectives and evidence of these in outcomes created by middle-and high-school students.
Definition of the scale for assessment. As a part of an initial proposal of a scale, a scoring rubric has been defined to identify criteria with which the students' learning outcome is measured. It represents a descriptive scoring scheme (Brookhart, 1999;Moskal, 2000) for performance-based assessments of ML artifacts created as learning outcomes (Kandlhofer et al., 2016). Therefore, we identified the characteristics that are to be evidenced in a student's work to indicate proficient performance in relation to the respective learning objectives (Allen & Knight, 2009;Brookhart, 1999;Moskal, 2000). Then, for each criterion, performance levels were defined as descriptions of the different score levels.

Evaluation of internal consistency and content validity.
To evaluate the initial proposal of the scale, we conducted an expert panel, in which the participants assess exemplary learning outcomes using the scoring rubric, and, afterward provide feedback through a questionnaire. The expert panel consisted of 16 professionals with relevant fields including machine learning and/or computing education and related areas including mathematics, computer graphics, and psychology. We evaluated internal consistency by analyzing inter-rater reliability, which relates to the issue that a student's score may differ among different raters. We used Fleiss' kappa coefficient based on the scores given by the participants concerning two ML models created as exemplary learning outcomes using the developed rubric (Fleiss et al., 2003;Moskal & Leydens, 2000). Content validity was evaluated based on the questionnaire responses by analyzing correctness, completeness, clarity, and relevance, evaluating the extent to which criteria reflect the variables of the construct, and determining whether the measure is well-constructed (Moskal & Leydens, 2000;Rubio et al., 2003). Content validity was analyzed through descriptive statistics and the content validity ratio proposed by Lawshe (1975). The results have been interpreted and discussed in the respective educational context.

State of the Art
To review the state of the art and practice on how ML concepts and practices are being assessed, we performed a systematic mapping study following the approach proposed by Petersen et al. (2008). Searching digital libraries in this field including ACM Digital Library, IEEEXplore, Scopus, arxiv, SocArXiv, and Google Scholar/Google to minimize the risk of omitting instructional units that may not have been published as scientific articles, we considered any instructional unit (e.g., course, activity, tutorial) that covers teaching ML in elementary to high school. As in several cases, we observed that courses do not necessarily focus exclusively on ML, but rather cover this topic as a part of a wider course on Artificial Intelligence (AI), we also searched for AI courses in order to minimize the risk of omission. Yet, instructional units on AI that do not cover ML topics were excluded as well as instructional units targeting other educational stages. As a result, a total of 14 instructional units were identified that also adopted some kind of assessment (Table 1) (Salvador et al., 2021). Most focus on the assessment of basic ML concepts with some also covering neural networks and/or the impact of ML. The majority assesses learning on the remembering and understanding level following Bloom's taxonomy (Bloom et al., 1956). Nine of the instructional units approach the assessment of learning the application of ML. Most of the assessments are quite simple, in some cases consisting of single-question quizzes at the end of learning units or only monitoring task completion. An exception is Elements of AI (Elements of AI, 2019) assessing also the answers to exercises. Two courses teaching ML with MIT App Inventor (2019a, 2019b) propose tests composed of three multiple-choice questions for the assessment of the students at the end of the course. Considering that currently most of these ML courses are offered as extracurricular activities such lightweight assessment approaches may be adequate to prevent the demotivation of the students. Yet, the lack of a more rigorous assessment may impede better support for their learning and the improvement of these courses. Very few adopt performance-based assessment defining rubrics for the assessment of presentations (Exploring Computer Science, 2019), learning results (Technovation Families, 2019a), or based on performance measures of ML models created by the students (Sakulkueakulsuk et al., 2018;Gresse von Wangenheim et al., 2021). Due to the recentness of ML courses in K-12 education settings, most focus on the assessment of results from the use stage, with only Apps for Good (2019), Technovation Families (2019a), Rodríguez García et al. (2020), and Van Brummelen et al. (2020) adopting a computational action (Tissenbaum et al., 2019) strategy that allows students to develop their own custom ML models that provide an impact on their lives and communities. To date, these assessments are to be performed manually by the instructors or judges (Technovation Families, 2019a). Only some of the quiz-based assessments are automated as a part of online courses. Instructional feedback to the student is typically limited to the indication of if the question(s) have been answered correctly, However, in general, the proposed assessments seem to be just emerging, lacking further information on how they have been designed or evaluated, especially when comparing them to research on assessment in computing education in K-12 in general. As a consequence, there seems to be no information on the reliability and validity of such assessments available.

Definition of the Performance-Based Assessment
Focusing on an active learning strategy taking students to create ML models with artificial neural networks, authentic assessment based on the created outcomes is an appropriate means allowing the openness of student responses, as opposed to, for example, multiple-choice assessments (Messick, 1996;Torrance, 1995). The assessment is based on the assumption that certain measurable attributes can be extracted from the artifacts created by the students during the learning process, evaluating whether the artifacts show that they have learned what they were expected to.
For performance-based assessment, typically scoring rubrics are adopted that define descriptive measures to separate levels of performance on a given task by delineating the criteria associated with learning activities (Moskal, 2000;Mc-Cauley, 2003;Whittaker et al., 2001). By converting rubric scores, grades are determined in order to provide instructional feedback.
Here, we aim at the development of a scale that aims at assessing the proficiency of students on basic ML concepts. As a part of the scale, we define a scoring rubric establishing criteria used for scoring the created ML artifacts from the point of view of the instructor in the context of K-12 education, primarily middle and high school. The scoring rubric describes how observable variables summarize a student's performance in the task of developing an ML model for image recognition from the work products that are produced by the student during this task.
As currently, almost every student is a novice to ML, we focus on the use stage of the learning cycle on which students start to develop pre-defined ML models, for example, by following a step-by-step tutorial. The assessment is defined in conformity with the K-12 Guidelines for Artificial Intelligence (Touretzky et al., 2019) referring to Big Idea 3 -Learning, AI literacy as defined by Long and Magerko (2020), covering general computing topics as proposed by the Computer Science Teachers Association (CSTA, 2017). Here, we focus exclusively on learning objectives related to the development of ML models using a supervised learning approach for image recognition enabling students to become creators of intelligent solutions (Kandlhofer et al., 2016;Long & Magerko, 2020;Sulmont et al., 2019;Touretzky et al., 2019).
Building a human-centric manner ML application is an iterative process that requires students to complete a sequence of phases on the use stage (Amershi et al., 2019;Mathewson, 2019) with the help of visual ML tools such as Google Teachable Machine (Carney et al., 2020;Gresse von Wangenheim et al., 2021;Gresse von Wangenheim et al., 2020): Data management: During this step data is either collected or pre-assembled datasets are provided that may be low-dimensional to facilitate understanding or be messy on purpose to demonstrate issues of bias (D'Ignazio, 2017;Sulmont et al., 2019). The data is cleaned by excluding messy images. and leaving it more balanced, including the same number of images for each category. For supervised learning, the datasets also need to be labeled. The data set is typically split into a training set to train the model and a test set to perform an unbiased performance evaluation of the model on unseen data.

Model learning:
A ML model is typically built upon pre-trained models that have been proven effective in comparable situations by training the model with the data and using a specific learning algorithm. Training parameters, for example, the learning rate, epochs, and batch size are specified to improve performance. After the transfer learning step, the performance of the model can also be improved by hyper tuning the learning.
Model evaluation: The model can be tested with new images that have not been used for training. In addition, performance metrics (e.g., accuracy) can be analyzed and interpreted, identifying possible improvement opportunities. The performance results can also be visualized as a confusion matrix, a table that in each row presents the number of examples of predicted categories while each column represents the number of examples of actual classes, facilitating the identification of data that is not classified correctly.
Considering the application of this ML process in the use stage, other phases of the human-centric ML process are not considered, as requirements are typically pre-defined and the model deployment and monitoring phase may represent additional content in combination with other computing/programming courses. And, adopting deep learning, feature design is shifted to the underlying learning system along with classification learning. Furthermore, to support students in their first steps to start to understand ML, certain fine-grained details of the neural network structure may be concealed as black boxes to lessen cognitive load (Resnick et al., 2000). Based on this domain analysis, the learning objectives are defined as presented in Table 2. LO2 Train an ML model (Touretzky et al., 2019;Long and Magerko, 2020) LO3 Evaluate the performance of the ML model Based on the human-centric ML process (Amershi et al., 2019;Long and Magerko, 2020) Concerning these learning objectives, we defined a performance-based assessment based on the artifacts developed by the students as outcomes of the learning process. Adopting a visual tool Google Teachable Machine (2020) for teaching the creation of ML models at this educational stage (Carney et al., 2020), evidence for the achievement of these learning objectives can be obtained based on the ML artifacts developed by the students, including the prepared dataset, model training parameters, and evaluation results (Fig. 2). Therefore, we define an initial scale defining the items to be measured to assess the ability to develop an ML model indirectly inferring the achievement of ML competencies. The criteria to be used in scoring the artifacts created by the students during the development of an ML model are defined as a rubric (Table 3). These scores can be used to provide instructional feedback guiding the students' learning as well as indicating improvement opportunities concerning the instructional unit. Performance levels are defined on a 3-point ordinal scale ranging from poor to good based on typical learning outcomes expected at this educational stage.

Evaluation of the Scoring Rubric
In order to analyze the quality of the scoring rubric, in terms of internal consistency and content validity, we conducted a series of scientific procedures including an expert panel review and statistical analysis to determine primarily psychometric properties of the rubric.

Definition of the Evaluation
As a part of the evaluation, the experts performed assessments of two ML models created as exemplary learning outcomes using the developed rubric. The artifacts created as learning outcomes include the prepared dataset, the training parameters, as well as an evaluation report, documenting the tests run, and the analysis and interpretation of the evaluation results (Fig. 2). On purpose, we prepared one weak learning outcome (e.g., few images in the dataset, few test runs) and one strong one that satisfies almost all criteria at the highest performance level. Internal consistency was evaluated regarding the inter-rater reliability of the assessments by the experts. Once experienced with the application of the rubric, the experts provide further feedback concerning content validity with respect to correctness, completeness, clarity, and relevance (Lawshe, 1975;Moskal and Leydens, 2000;Rubio et al., 2003). Each question is rated dichotomically by the experts, suggesting changes when necessary. The priority of each of the assessment criteria is rated on a 3-point ordinal scale ranging from not relevant to essential.

Execution of the Evaluation
We systematically selected participants from the Computing in School initiative at the Federal University of Santa Catarina and external participants, who are recognized experts with academic and/or practical experience in the subject matter. The participants were invited via email explaining the objective of the evaluation and assuring confidentiality. Participation was voluntary. Instructions and data collection forms were made available online. We invited 20 experts and obtained a response rate of 80% (n = 16). The majority of the participants have experience and knowledge in machine learning and/or computing education, while also including experts from related areas such as mathematics, computer graphics, and psychology enabling the collection of feedback from different points of view (Fig. 3). Although most participants are researchers, four participants are K-12 teachers representing directly the target audience.

What is the Evidence of Internal Consistency of the Performance-based Assessment Scoring Rubric?
In order to evaluate if the design of the scoring rubric allows a reliable assessment (Moskal & Leydens, 2000), we analyzed inter-rater reliability among the responses of the experts' assessments of the two examples of ML learning outcomes. For the analysis, we used Fleiss Kappa (Fleiss et al., 2003), a measure that extends Cohen's Kappa for the level of agreement between two or more assessors as in our case 16. Commonly, values below 0 indicate less than chance agreement, values between 0.01-0.20 indicate slight agreement, 0.21-0.40 fair agreement, 0.41-0.60 moderate agreement, 0.61-0.80 substantial agreement, and 0.81-0.99 almost perfect agreement (Landis & Koch, 1977).
Inter-rater reliability. Analyzing the 11 items of the scoring rubric using the assessment from 16 experts, we obtained a value of Fleiss kappa = 0.617 indicating substantial agreement. This is confirmed by the p-value (p < 0.0001), indicating that the kappa value is significantly different from zero. Individual kappa values for each of the performance levels separately were also computed and compared to all other categories together (Table 4).
Practical experience in teaching computing subjects in K-12 Experience in creating a Machine Learning model Number of participants per expertise knowledge area (As some experts have expertise in more than one knowledge area, the sum of participants in different areas is bigger than the number of the participants) A substantial agreement between assessors can be observed on the lowest (i.e., poor) and highest (i.e., good) performance level, while on the other hand, only a moderate agreement on the intermediate performance level (i.e., acceptable) (Table 4). This points out that it seems easier to recognize very good or very poor performance, rather than an intermediate performance that may be classified as either good or poor respectively by some assessors.
We also computed individual kappa values for each of the performance levels separately for the exemplary weak and strong learning outcomes assessed by the experts.
Here, we can observe substantial agreement between assessors on the lower performance levels, while they demonstrated only a slight agreement on the highest performance level (Table 5). It seems rather surprising that even for a learning outcome that has been constructed as weak on purpose, some assessors still rated some criteria on the highest performance level. While in one case such a higher assessment may be related to the specific perception of some assessors, other criteria such as C2 and C4 demonstrated a variance across all three performance levels, indicating that the assessment of the relevance and categorization of the images can be difficult to judge.
Different from these results, higher performance levels are more consistently rated than the lowest level concerning the exemplary strong learning outcome (Table 6).
While criteria C2 and C4 have been rated much more uniform in this case, criteria C5, C8, and C10 presented variances across all performance levels. The divergence regarding C5 may be due to the low quality of the images presented to the experts making the identification of messy images difficult. The disagreement concerning the interpretation criteria may be due to different ML knowledge levels of the assessors, indicating a need for well-trained K-12 teachers and/or automated support for the assessment. Table 5 Computed Kappa for the assessment of exemplary weak learning outcome
Fleiss Kappa 0.606 0.545 0.180 Table 6 Computed Kappa for the assessment of exemplary strong learning outcome Performance levels Poor -0 pt. Acceptable -1 pt. Good -2 pt.
Fleiss Kappa 0.455 0.541 0.526 Yet, for most items, the majority of assessors agreed on the same rating. Initial results, thus, demonstrate that in general, a substantial agreement can be achieved. However, larger-scale studies are required to study the differences observed on a broader variety of learning outcomes.

What is the Evidence of Content Validity of the Performance-based Assessment Scoring Rubric?
Most participants considered the criteria and performance levels in general as correct (88%), complete (75%), and clear (63%).
Correctness: Concerning correctness, three experts observed that the criteria related to the interpretation of the confusion matrix may not admit the possibility that object classification errors are not identified correctly, while the interpretation of the model is correct. As criterion C10 combines these two aspects, either additional performance levels have to be added to comprehensively represent all combinations or the criterion needs to be split into two separate ones. Another suggestion is also related to a more detailed refinement of performance levels, e.g., by dividing the highest performance level of criterion C7. Tests with new objects into several ones.
Relevance: All items of the rubric have been considered most essential on a 3-point ordinal scale ranging from irrelevant to essential, with few experts considering some criteria as only desirable (Fig. 4). None of the criteria has been considered to be irrelevant. Analyzing the content validity ratio defined as CVR = (Ne-N/2) / (N/2), in which Ne is the number of experts marking essential and N is the total number of experts, the adequacy of the rubric was also confirmed. Only criteria C3. Distribution of the dataset has a CVR = 0.38 below the threshold of 0.49 (Lawshe, 1975), as in this case four experts considered the criteria only desirable but not essential. Consequently, this criterion could be excluded to minimize the assessment effort. Completeness: Some participants suggested new criteria, i.e., on checking if the student did not reuse images of the training set to test the model and/or if the tests included at least one test for each image category. Other suggestions requiring the inclusion of further learning content and/or reporting by the student, include the assessment of reflections of the student his/her tests to be sufficient, and further data pre-processing activities.
Clarity: Some items are quite subjective compared to the other ones, which can cause uncertainty and inaccuracies from the point of view of the assessor. This may also be the reason for a lack of higher inter-rater agreement. Yet, observing a disagreement even on quantifiable criteria, such as C1. Quantity of images may also indicate other factors. Some participants suggested revising the wording to be less technical in order to be more easily understood by non-computing K-12 teachers. Furthermore, as some criteria depend on the specific ML model developed by the students, e.g., C9 and C10 aiming at assessing if the student correctly identified categories with low accuracy and correctly interpreted the evaluation results, substantial ML knowledge and effort from the assessor is required. As this may not be given currently in the context of K-12 education on a larger scale, a possible solution would be to automatize the assessment, refining the criteria through fine-grained rules and/or adopting Machine Learning techniques to assess subjective criteria.
All experts considered the rubric applicable in K-12 education, taking into consideration a careful definition of the specific learning objectives and strategies, as well as its complementation by other types of assessments to obtain a more comprehensive understanding of the students' learning performance.

Discussion
Considering the importance of innovative educational opportunities for young people to gain a better understanding of ML concepts and practices to succeed in the 21st century, we aim to teach ML primarily in middle and high school by proposing a scoring rubric for the performance-based assessment of ML learning outcomes.
In this regard, the presented research aims at advancing the current state of the art like, different from most other approaches using quizzes or tests, proposes a scoring rubric based on the ML artifacts created by the students. The few other scoring rubrics for this kind of assessment encountered during the literature review either aim at the assessment of the end result in a more abstract way or on the presentation of the result. For example, the rubric to be used by judges in the AI Family Challenge (Technovation Families, 2019a) focuses on a general assessment of the end result in the challenge taking students to create their own intelligent solutions. It includes criteria on the ideation such as "How well does the team's invention solve the problem in their community?", project development, pitch and communication, and overall expression (i.e., How much does the submission stand out from others?). Regarding project development, the rubric only includes three criteria (How well does the invention use AI or other technologies?; How well thought out is the team's prototype or plan to create a prototype?; Does the invention solve the problem in a unique way?) not covering in a more detailed way the ML concepts to be learned. Another example of a scoring rubric is given as part of the Alternate Curriculum Unit: AI (Exploring Computer Science, 2019). This rubric is used for the assessment of the presentation of the final results of the student including only criteria related to presentations such as content quality, presentation quality, image and video presentation, use of English conventions. In this way, the proposed scoring rubric represents the first step for a more detailed assessment of the ML artifacts created as learning outcomes.
Furthermore, to date, no rigorous information on any kind of evaluation of the reliability and validity of the proposed assessment approaches in literature has been explicitly encountered, thus our research also stands out by conducting an initial evaluation through an expert panel. Results of this initial evaluation confirmed mostly the definition of the established criteria and performance level descriptors as a part of the proposed scale. Using the assessment from 16 experts, a value of Fleiss kappa = 0.617 showed that a substantial inter-rater agreement can be obtained using the scale. It seems however easier to recognize very good or very poor performance, rather than an intermediate performance that may be classified as either good or poor depending on the assessor. Observed inconsistencies between the assessments of different assessors may also be related to the specific perception of some assessors and their proficiency level in Machine Learning, pointing out the need for well-trained K-12 teachers to enable reliable assessments.
Some criteria such as the one related to the inclusion of messy images may also be difficult to be assed manually, indicating an opportunity for the automation of this kind of performance assessment to facilitate the assessment and achieve more consistent results, while also reducing effort related to the assessment in practice.
Regarding content validity, all experts considered the scoring rubric applicable in K-12 education, taking into consideration a careful definition of the specific learning objectives and strategies, as well as its complementation by other assessments to obtain a more comprehensive understanding of the students' learning performance. This is also confirmed based on the results of the analysis of the content validity ratio. Only one exception, criteria C3. Distribution of the dataset with a CVR = 0.38 demonstrated results below the expected threshold as four experts considered the criteria desirable but not essential. Yet, as it is more probable to achieve acceptable performance of an ML model using a well-balanced dataset with more or less the same quantities of images for each of the categories, we still consider this an important assessment criterion to be revised in further studies.
With exception of one criterion related to the interpretation of the confusion matrix that seems to combine two different criteria and should therefore be separated into different criteria and a suggestion to refine the criteria related to the number of tests performed with new objects into more performance levels, all experts considered the criteria correct.
Regarding completeness, some new criteria have been suggested such as checking if the student did not reuse images used during training for testing and/or if the tests included at least one test for each image category. Concerning clarity of the definition of the criteria and performance level descriptors, some participants suggested revising the wording to be less technical to be more easily understood by non-computing K-12 teachers, again pointing out the need for teachers to be well-trained in Machine Learning. Therefore, some criteria might need to be further refined as well as additional criteria to be added for a more comprehensive assessment.
Threats to validity. To mitigate threats related to the research design, we systematically developed the scoring rubric based on an analysis of the educational context adopting methods for rubric definition and conducted an initial evaluation in the form of an expert panel. Another threat is related to the diversity and sample size of the participants in the evaluation. Although primarily including experts from the computing area, the panel does cover diverse areas of interest, representing diverse points of view, as well as the target audience including K-12 teachers. In terms of size, there is also evidence that 16 experts are sufficient to draw results (Lawshe, 1975). To reduce threats associated with the data analysis, we conducted a statistical evaluation following Lawshe (1975), Moskal and Leydens (2000), and Rubio et al. (2003). We followed the methodology proposed by Lawshe (1975) to minimize any impact of bias due to the subjectiveness of the experts' feedback. Representing only an initial evaluation, larger-scale studies are necessary to confirm the results and analyze open issues.

Conclusion
Based on the domain analysis and modeling regarding the teaching and learning of basic ML concepts, we propose a scoring rubric as part of a scale for the performance-based assessment of ML learning outcomes in the context of K-12 computing education. Results of an initial evaluation confirmed mostly the definition of the established criteria and performance level descriptors as a part of the proposed scale, indicating a substantial inter-rater agreement as well as content validity in terms of correctness, relevance, completeness, and clarity.
Thus, based on first positive feedback, the proposed scoring rubric presents a first step for the assessment of open-ended ML learning activities regarding image recognition with supervised learning, which can be used by instructional designers and researchers to evolve support for the assessment in the context of teaching Machine Learning in K-12 as well as by instructors to assess the outcomes of students in this educational context. Of course, although our focus in this article is on the proposal of a performance-based assessment based on learning outcomes, in educational practice this kind of assessment should be completed by other types of assessment such as observations or interviews.
Based on these results, we are currently revising the initial scale. We are also planning further studies to collect data based on learning outcomes created by students aiming at the development of a measurement model using Item Response Theory as a part of the evidence model that gives information about the connection between the student model variables and observable variables.