Recent Advances in Research on the Xu-Argument and Future Directions | 更新时间:2026-02-25
Exploring the Reliability and Validity of AI Scoring in the Continuation Writing Task
Jie Zhang ,  Chunhua Ma *    作者信息&出版信息
Chinese Journal of Applied Linguistics   ·   2026年2月25日   ·   2026年 49卷 第1期   ·   DOI:10.1515/CJAL-2026-2026-0105
0 0(CNKI)
PDF
该文暂无导航

AI 摘要

1. Introduction

This chapter introduced the continuation writing task as an integrative reading and writing assessment increasingly used in high-stakes exams like the Gaokao. It highlighted the task’s dual demands on students to produce original, creative continuations that remain coherent with the source text’s plot, tone, and emotional logic. These demands elevate core constructs such as content creativity, textual integration, narrative coherence, and emotional depth beyond traditional emphasis on linguistic accuracy, thereby complicating consistent and fair scoring. The chapter discussed the challenges human raters face in reliably evaluating these constructs despite training, emphasizing the difficulty of assessing creativity and alignment. It then described recent advances in Generative AI tools, contrasting them with conventional Automated Essay Scoring systems by their ability to holistically assess narrative and affective elements and adapt scoring approaches across formative and summative contexts. The potential of AI to address scoring subjectivity and construct complexity in continuation writing was explored. The study’s aim to compare AI-generated ratings and feedback with those of experienced human raters was outlined, focusing on reliability, validity, scoring severity, focus differences, and effective AI integration into assessment. The chapter positioned this research as contributing valuable insights for educators seeking to leverage AI for improved teaching and evaluation in continuation writing.

2. Literature Review

This chapter discussed the assessment of continuation writing tasks, highlighting their introduction into the Gaokao English test, where test takers extend a narrative by writing two paragraphs that align with a given opening sentence. This task integrates reading comprehension and writing production, fostering interactive alignment in vocabulary, syntax, and discourse between the learner’s text and the source narrative. Introduced provincially in 2016 and later adopted nationwide, the task discourages formulaic responses, prompting teachers to engage students in interactive reading and collaborative writing, with students focusing more on vocabulary and cohesion. Recent research has primarily examined reliability and validity, construct structure, and task design, with studies confirming the task’s ability to assess global language proficiency and defining its construct as involving reading, writing, logical thinking, and learning ability, emphasizing content generation conditioned on source text integration. The task differs from traditional essay prompts by requiring novel yet constrained content that aligns coherently and logically with preceding text. Despite advances in construct definition, little is known about how such competencies are operationalized in scoring, which is critical for teaching and high-stakes testing. Studies eliciting teacher evaluative criteria identified content creativity, language use, coherence, and writing conventions as dominant dimensions, with variations in weighting originality and textual integration. Rater training practices have shown to enhance scorer consistency but face challenges especially in scoring mid-level continuations due to content-language trade-offs. Prompt explicitness has been found to influence rater agreement, with opening-sentence-only prompts achieving the highest inter-rater consistency, while absent or overly specific prompts increase variability. Thus, scoring meaning emerges from an interaction of textual features, prompt design, and rater norms, highlighting the need to stabilize this intersection for fairness. The chapter identified uncertainty around balancing content and language evaluations in continuation writing as an obstacle both to positive instructional washback and reliable, valid assessment outcomes in large-scale exams.

The discussion of technology-assisted writing assessment traced its origins to the 1960s and the development of automated essay scoring (AES) systems like E-rater and IntelliMetric, which use statistical and machine learning techniques to approximate human scoring with high reliability and validity in high-stakes contexts. AES validity is typically supported by correlations with human scores and rubric alignment through expert review. However, these systems tend to rely on fixed rules and statistical models that limit capturing essay semantics, logical structure, coherence, and rhetorical effects, especially for non-standard or low proficiency writing. AES also struggles with cultural and subject-area language variations. Furthermore, student acceptance of AES feedback is generally low, particularly among advanced learners who find it mechanical and lacking detailed, actionable explanations. The advent of AI, especially generative AI tools like ChatGPT, has introduced new potential for writing assessment. ChatGPT has demonstrated notable consistency with human raters and can provide feedback encompassing grammar, vocabulary, content, organization, and logic. Studies indicate that ChatGPT’s feedback can enhance learners’ understanding and writing skills more comprehensively than human teachers’ feedback. Current research continues to apply dual validation strategies—examining quantitative score agreement and rubric-aligned commentary—to assess AI reliability and validity in scoring. Despite these advancements, specific research on using generative AI for continuation writing assessment remains scarce. Given the challenges faced by teachers in scoring continuation writing tasks, this study aims to explore the feasibility of employing GenAI tools in this context by investigating the extent to which AI-generated scores align with experienced teacher ratings and whether AI commentary reflects teacher evaluative criteria, thereby addressing both reliability and validity in AI-assisted scoring of continuation writing.

3. Methods

This chapter described the continuation writing task modeled on the Gaokao format, requiring participants to complete a partially presented narrative by writing two paragraphs guided by given sentence starters. The reading text depicted a story about reuniting elderly people through returning a lost wallet, with a moderate difficulty level indicated by a Flesch-Kincaid Grade Level of 5.2. Data collection involved administering the task to 63 Grade 11 students from a southeast Chinese high school, sampling 21 writings for analysis after exclusion and transcription. AI scoring was performed using ChatGPT 4.0, which was prompted with task details, Gaokao criteria, and learner profiles to assign holistic scores on a 0–15 scale across dimensions of content, language, and organization. The AI scored each text three times at separate intervals in new dialogue boxes and also adopted two different rater roles to assess scoring adaptability. Eight experienced high school English teachers independently rated the same 21 texts following Gaokao rubrics, providing holistic scores and qualitative comments about key text features. The data analysis focused on two main evidence types: reliability and validity. AI intra-rater reliability was evaluated across three rounds and two rater roles using mean scores, variance, and Spearman correlations, while inter-rater reliability compared AI and teacher scores with similar statistical measures. A Multi-Faceted Rasch Measurement (MFRM) analysis using FACETS software examined severity, consistency, and bias across all raters and scoring conditions, treating raters and continuations as facets with stricter scoring represented by higher severity values. Rater comments were systematically coded by segmenting long sentences into meaning units aligned with three main Gaokao dimensions—content creation, language use, and text organization—and further refined through an iterative top-down and bottom-up coding framework development. Two researchers independently coded all comments with 92% agreement, resolving discrepancies to finalize the coding scheme. Quantitative counts of meaning units under secondary codes were tabulated, complemented by a qualitative thematic analysis comparing the focus areas in AI and teacher feedback.

4. Results

This chapter described the reliability and validity of AI scoring in the continuation writing task through multiple analyses. The reliability of AI scoring was first examined by comparing three rounds of AI ratings taken at different times. The mean scores increased slightly across rounds, and the third round had the smallest standard deviation, indicating more consistent scoring. Spearman correlation coefficients between rounds showed moderate to strong significant correlations, and score differences for most continuations were within two points, demonstrating reasonable intra-rater consistency over time.

Next, AI scoring consistency was analyzed between two roles: acting as a teacher versus a high-stakes test rater. Although mean scores differed significantly—with stricter scoring observed in the high-stakes role—the rank ordering correlation was high, indicating that the AI maintained consistent ranking despite changes in scoring severity. Score differences between roles revealed the AI’s ability to modulate strictness based on context, reflecting distinct criteria for classroom versus high-stakes environments.

Consistency between AI raters and human teacher raters was also explored. Eight teachers showed varying mean scores and standard deviations, with most teachers’ ratings significantly correlated in rank order except for a few outliers. Differences among teachers were comparable to those found across AI scoring rounds, indicating that human raters experience similar variability. Teachers’ mean scores generally fell between the AI’s two role-based scores. The correlation between the average teacher scores and AI scores was low and not significant, suggesting discrepancies in evaluation criteria and resulting in divergent ranking of story continuations. Further analysis showed that teachers tended to give higher scores than AI for high-level continuations and lower scores for lower-level ones, possibly due to heightened response to salient features in the text.

Severity and self-consistency of raters were assessed using Many-Facet Rasch Measurement (MFRM) analysis, which provided severity measures and fitness statistics. Most raters, human and AI alike, demonstrated severity values within an acceptable range, indicating moderate variance in scoring strictness. The AI clearly distinguished severity between its two roles. Infit statistics suggested that most raters displayed good consistency with the model’s expectations; however, one teacher showed signs of inconsistency, and two others showed overfitting tendencies. AI ratings were mostly well-aligned with expected consistency, with only one round showing slight overfitting, highlighting reliable internal consistency.

The validity of AI ratings was evaluated by comparing the areas of focus and rationales cited by AI and teachers. Frequencies of coded categories in rater comments were adjusted to allow comparison despite different numbers of raters. The average frequency of references to various aspects in the 21 continuations highlighted similarities and differences in what teachers and AI emphasized during scoring. This analysis aimed to understand whether AI and human raters shared common evaluation criteria or differed in their focus when justifying scores.

5. Discussion

This chapter discussed the reliability of AI scoring in continuation writing tasks, revealing that AI scores show only minor fluctuations over time when scoring identical prompts and continuations. The rank order correlations across multiple scoring rounds were high and statistically significant, indicating internal consistency. The study also showed AI's ability to adjust scoring severity and differentiation based on assigned roles, such as teacher or high-stakes examiner, with high correlations between these role-based scores. AI was generally more lenient than teachers except when acting as a high-stakes rater, and exhibited slightly lower score variability. The low and statistically insignificant correlation between AI and teacher scores highlighted differences in scoring approaches. These results imply that AI scoring functions probabilistically under prompt guidance rather than following fixed rules, providing stable self-consistent assessments. Human raters displayed notable individual differences despite training, contributing substantially to score variability. Specific cases included one stricter teacher and one poorly fitting scorer, accentuating human inconsistency. The chapter pointed out that while AI lacks human interpretability, its high intra-rater consistency positions it as a reliable tool for formative assessment, capable of delivering timely feedback to track students' narrative coherence and emotional expression progress often missed in traditional evaluations.

The validity of AI scoring was examined, showing AI’s capacity to differentiate high- and low-quality continuations and consistently apply scoring criteria aligned with task demands. AI adapted its scoring to roles defined in prompts, demonstrating purposeful variation between teacher-like encouragement and exam-like differentiation. In comparison with teachers, AI exhibited higher internal consistency across score ranges and concentrated on principal rubric dimensions, particularly emphasizing content creation quality unique to continuation tasks. AI's evaluation leaned more heavily on the overall effect, prioritizing emotional depth in content creation, vocabulary appropriateness over richness, and genre-specific narrative elements such as plot and character development. This provided a more comprehensive and integrative assessment than that of teachers, who tended to focus on discrete features like lexical sophistication, grammatical accuracy, and cohesive devices. Teachers’ stricter and differentiation-focused scoring reflected their exam-oriented background, especially concerning Gaokao requirements, suggesting an internalized high-stakes perspective. AI’s adaptability in scoring purposes and its holistic focus contrasts with teachers’ operationalized priorities tailored to teaching goals. The findings accord with prior research indicating that generative AI feedback offers a broader evaluation scope than human raters. AI's strengths in assessing narrative coherence and emotional resonance addressed persistent challenges faced by human evaluators in judging content creativity and alignment with prompts. Consequently, AI scoring complements traditional assessments focused on linguistic accuracy by providing holistic feedback to support teachers in fostering richer story development. The divergence between AI and human scoring foci underscores the necessity for further research to fine-tune AI prompts and better harmonize AI evaluation with human standards, especially in high-stakes environments where alignment is critical.

Implications for teaching and assessing continuation writing emphasized leveraging AI’s complementary role alongside human raters in pedagogy and assessment practices. AI’s leniency on minor language errors and emphasis on emotional and narrative cohesion make its real-time feedback a useful tool for short revision cues, which research suggests can yield rapid text improvements. This emphasis redirects instructional focus from superficial lexical complexity to the alignment of new text with original narrative structure. To enhance student comprehension of AI feedback, guided worksheets translating abstract AI terminology into concrete writing moves have proven effective. The stable consistency of AI scoring across drafts enables teachers to monitor progress and identify persistent weaknesses, facilitating targeted micro-lessons aligned with assessment-for-learning principles. In formal assessments, AI’s reliability positions it as a potential second rater rather than a substitute for human judgment. It can triage papers by initially screening all submissions, allowing human raters to focus on borderline cases and expert adjudicators to resolve significant discrepancies, thereby reducing cost and rater disagreement. This hybrid approach preserves teacher authority while enhancing fairness and transparency in assessment. The chapter also cautioned that despite AI’s consistency, its scores remain probabilistic outputs influenced by input and prompt design, requiring ongoing refinement for optimal educational integration.

6. Conclusion

This chapter discusses extending AI-supported writing assessment from argumentative essays to continuation writing, which demands reading comprehension, creativity, and affective alignment. It reports a reliability and validity audit comparing AI scoring with experienced teachers and two scoring personas, showing the AI’s ability to consistently measure emotional depth and narrative stance. The AI offers a diagnostic perspective that complements human judgment. Future research is suggested to explore task-invariance across genres, improve prompt design to minimize human-machine ranking differences, and integrate AI-human hybrid scoring in large-scale testing to evaluate efficiency, rater consistency, and instructional impact for sustainable, ethical application.

* 以上内容由AI自动生成,内容仅供参考。对于因使用本网站以上内容产生的相关后果,本网站不承担任何商业和法律责任。

展开

当前期刊

当前期刊
    目录

    推荐论文

    • Source-Text Processing in Xu-Argument-Based Continuation Writing: An Eye-Tracking Study

    • Impact of Team Boundary Activity on Team Performance in Multiple Team Membership Circumstance

    • How does Manufacturing's Service Platform Create Value?