The background of the automated analysis research project on the lexical richness of Chinese vocabulary is introduced, including support from the Sino-Foreign Language Exchange and Cooperation Center of the Ministry of Education's special and major projects on international Chinese education research collaborations. Contributions from Professor Yang Lijiao and students Xu Huidan and Qiu Danyang in data resources are also acknowledged. Gratitude is extended to the anonymous reviewers and editors for their valuable feedback.
The importance of lexical knowledge in language teaching and acquisition research is discussed, distinguishing between receptive and productive knowledge while emphasizing the role of lexical richness in assessing productive vocabulary proficiency. Extensive research on lexical richness exists in English academia, whereas studies in Chinese linguistics remain relatively scarce and lack a systematic, comprehensive measurement framework. This study aims to construct dimensions, metrics, and tools for measuring lexical richness in Chinese texts, including the development of a lexical knowledge base, the design of 145 measurement indicators, and the automated extraction and analysis using natural language processing techniques. Through essay scoring and text classification tasks, over 60 validated indicators were systematically organized, culminating in the creation of the Chinese Lexical Richness Analyzer (CLRA) tool to support Chinese language teaching and research.
This section outlines the foundational design of Chinese lexical richness metrics, incorporating the construction of a large-scale lexical knowledge base and multifaceted feature considerations. The vocabulary list was adjusted based on the International Chinese Education Chinese Proficiency Grading Standards, addressing homographs and multifunctional words to ensure accurate level classification. High-frequency words were collected from the Dynamic Corpus of International Chinese Education and HSK test samples, with manual annotation and level assignment based on morpheme forms and meanings. The knowledge base also incorporated compulsory education level information. For word usage frequency, metrics were derived from word frequency and distribution scope, integrating data from the National Language Committee's Modern Chinese Corpus and tailored frequency statistics for pedagogical needs. Semantic-cognitive features included polysemy, hypernym-hyponym relations, semantic abstractness, transparency, and word reaction time, with attributes sourced from resources such as Tongyici Cilin (a synonym lexicon), open-source hypernym-hyponym datasets, Xu and Yang's lexical tables for abstractness, Qiu Danyang's dataset for semantic transparency, and Zhang et al.'s repository for reaction time metrics.
This section presents the automated analysis of Chinese lexical richness, proposing a multidimensional framework encompassing lexical complexity, diversity, density, and length. For lexical complexity, 78 word-level indicators, 30 compulsory education word-level indicators, average word frequency, distribution scope metrics, and semantic-cognitive features were designed. Lexical diversity was measured through 10 indicators, including TTR and Root TTR, prioritizing those less affected by text length. Lexical density and length metrics calculated the proportions of different parts of speech and word lengths. A total of 145 indicators were designed, with automated analysis implemented using natural language processing and the knowledge base, generating Excel reports with comprehensive metrics.
This section details the development of the lexical richness indicator system in the automated analysis research. From an initial pool of 145 indicators, over 60 were validated and organized into four core dimensions. A test corpus comprising L2 essays, textbooks, and native-level materials was constructed. Based on stability (minimal text-length influence), predictive power, and independence, 76 robust indicators were selected. Further analysis revealed 75 indicators significantly correlated with essay scores or textbook levels, with 63 retained after addressing multicollinearity. These indicators spanned lexical complexity, diversity, density, and length, with 51 applicable to L2 essay evaluation and 53 to L2 textbook difficulty grading. Stepwise linear regression confirmed the system's validity, with R² values indicating large effect sizes, demonstrating its predictive efficacy for essay scores and text levels.
This section introduces CLRA6, a Python- and Tkinter-based automated lexical richness analysis tool featuring text annotation, vocabulary list generation, and metric analysis via a GUI. Application guidelines are provided for different research subjects, including L2 learner essays, L2 textbooks, and native-level texts. Findings suggest that lexical richness metrics stabilize with texts exceeding 300 characters, recommending such lengths for reliable analysis.
This study designed quantitative metrics for Chinese lexical richness across complexity, diversity, density, and length, constructing a knowledge base and enabling automated extraction. A refined indicator system was validated in practical applications, supported by the CLRA tool. Future work will optimize the metric system and software, explore native-speaker lexical proficiency assessment, investigate spoken-written modality differences, and leverage large language models for text generation and simplification, advancing digital resources for international Chinese education.
* 以上内容由AI自动生成,内容仅供参考。对于因使用本网站以上内容产生的相关后果,本网站不承担任何商业和法律责任。