Abstract
The concept of text-as-data has grown increasingly vital in the field of political science, providing researchers with a robust means to analyze the positions and behaviors of political figures, supported by compelling empirical evidence. One of the primary approaches involves analyzing the content of texts and assigning labels according to the purpose of the research. Nonetheless, this method typically depends on human labeling, which is a process that is expensive and time-consuming. While numerous studies have delved into the realm of automated labeling, most previous studies have predominantly focused on Germanic languages, such as English and German. In contrast, research dedicated to Chinese language corpora, particularly in the political science domain, remains scant. To bridge this research gap, we fine-tuned a pre-trained language model using a human-annotated dataset comprising oral questions posed by the councilors of Kaohsiung City. Our dataset contains 9,904 text samples that were systematically labeled into four categories using the method from Maricut-Akbik (2021), which categorizes the content of councilors’ questions into four levels of intensity: requesting information, requesting justifications, requesting changes in policy, and sanctioning. Our model achieves an overall accuracy of 76% on the testing set. In addition, three of the four labels achieved F1 scores above 80%. We believe that our results are helpful for scholars interested in analyzing legislative or political speeches in Chinese.
Technical Skills: Python(PyTorch,HuggingFace), LATEX
Our model is available for exploration and testing via the link provided below.