Introduction
The CAP Babel Project uses the Comparative Agendas Project policy codebook’s major topics to identify the policy areas of texts. (Please let us know if you have suitable training data for minor topics and we will be happy to add that task to the service as well.)
The codes assigned are unequivocal, mutually exclusive, and cover all potential policy issues. The codebook distinguishes 21 major policy areas: macroeconomics, civil rights, health, agriculture, labour, education, environment, energy, immigration, transportation, law and crime, social welfare, housing, domestic commerce, defence, technology, foreign trade, international affairs, government operations, public lands and culture.
The 11 CAP domains we are using:
- Media
- Social media
- Parliamentary speech (oral questions, interpellations, bill debates, other plenary speeches, urgent question, pre-agenda speech, post-agenda speech)
- Legislative documents (bills, laws, motions, legislative decrees, hearings, resolution, other)
- Executive speech
- Executive orders
- Party programs
- Judiciary
- Budget
- Public opinion
- Local government (agenda items, regulations)
Differences compared to the CAP website domains:
- we added "social media"
- separated "Parliamentary & Legislative" into two different domains: Parliamentary speech, Legislative documents
- separated "Prime Minister & Executive" into two different domains: Executive speech, Executive orders
Current (08/30/2023) data per domain:
Domain | Count | Domain | Count |
---|---|---|---|
budget | 54829 | execorder | 51436 |
execspeech | 167548 | judiciary | 7334 |
legislative | 655777 | media | 299578 |
parlspeech | 741638 | party | 131967 |
publicopinion | 2343 | social | 13954 |
Most of the language models that the CAP Babel Machine uses were fine-tuned on training data containing the label 'None' in addition to the 21 CAP major policy topics, indicating that the given text contains no relevant policy content. We use the label 999 for these cases. Note that some of the models (e.g., Danish legislative, Dutch media) do not recognize this category and thus cannot predict if the row has no policy content.
We have language-specific models for the following languages: Czech, Danish, Dutch, English, French, German, Hungarian, Italian, Norwegian, Polish, Portuguese, Slovak and Spanish, but we encourage you to also submit datasets not covered under this list, as results may be useful for additional languages due to the nature of large language models.
You can upload your datasets here for automated CAP-coding. If you wish to submit multiple datasets one after another, please wait 5-10 minutes between each of your submissions. There are two possibilities for upload: pre-coded datasets or non-coded datasets. The explanation of the form and the dataset requirement is available here.
The upload requires to fill the following form on metadata regarding the dataset. We kindly ask you to upload your dataset, and in case of a pre-coded dataset, if available, please attach the codebook used besides the dataset.
The non-coded datasets should contain an id and a text column. The column names must be in row 1. You are free to add supplementary variables to the dataset beyond the compulsory ones in the columns following them.
Pre-coded datasets must contain the following columns: id, year, major_topic, text. The column names must be in row 1. Uploading a pre-coded sample is optional, but it can help us with calculating performance metrics and fine-tuning the language model behind CAP Babel Machine. The detailed rules of validations are available here. The mandatory data format of major_topic is numeric. All textual CAP categories must be converted to the appropriate numeric code before uploading. Furthermore, records with no policy content should be coded with 999. You are free to add supplementary variables to the dataset beyond the compulsory ones in the columns following them. Automatic processing requires to follow these rules.
After you upload your dataset and your file is successfully processed, you will receive the CAP-coded dataset
and a file (both in CSV format) that includes the three highest probability category predictions by the CAP
Babel model and the
corresponding probability (softmax) scores assigned to each label. Please be aware that interpreting
softmax scores
as absolute model confidences could lead to false assumptions about model performance. The model prediction
results are deterministic, so metrics such as the reliability coefficient do not apply.
If the files you would like to upload are larger than 1 GB, we suggest that you split your dataset into multiple parts.
If you have any questions or feedback regarding the CAP Babel Machine, please let us know using our contact form. Please keep in mind that we can only get back to you on Hungarian business days.
Submit a dataset:
The research was supported by the Ministry of Innovation and Technology NRDI Office within the RRF-2.3.1-21-2022-00004 Artificial Intelligence National Laboratory project and received additional funding from the European Union's Horizon 2020 program under grant agreement no 101008468. We also thank the Babel Machine project and HUN-REN Cloud (Héder et al. 2022; https://science-cloud.hu) for their support. We used the machine learning service of the Slices RI infrastructure (https://www.slices-ri.eu/)
HOW TO CITE: If you use the Babel Machine for your work or research, please cite this paper:
Sebők, M., Máté, Á., Ring, O., Kovács, V., & Lehoczki, R. (2024). Leveraging Open Large Language Models for Multilingual Policy Topic Classification: The Babel Machine Approach. Social Science Computer Review, 0(0). https://doi.org/10.1177/08944393241259434