The CAP Babel Machine

We have been making adjustments to our infrastructure in the past few work days. Submissions may not complete successfully. If you are experiencing problems, please reach out to us on babelmachine [at] poltextlab [dot] com. We thank you for your patience and we apologize for the inconvenience.

Introduction

The CAP Babel Machine Project uses the Comparative Agendas Project policy codebook's major topics to identify the policy areas of texts. The codes assigned are unequivocal, mutually exclusive, and cover all potential policy topics. The list of all possible codes and their labels can be found here.

We use 11 CAP data domains. Please find a comparative table with the domains of the CAP website below:

Number	CAP Website	Number	Babel Machine
1	Media	1	Media
2	Parliamentary and legislative	2	Parliamentary speech (oral questions, interpellations, bill debates, other plenary speeches, urgent question, pre-agenda speech, post-agenda speech)
		3	Legislative documents (bills, laws, motions, legislative decrees, hearings, resolution, other)
3	Prime minister and Executive	4	Executive speech
		5	Executive orders
4	Political parties	6	Party programs
5	Judiciary	7	Judiciary
6	Budget	8	Budget
7	Public opinion and interest groups	9	Public opinion
		10	Local government (agenda items, regulations)
		11	Social media

Latest (08/30/2023) data per domain:

Domain	Count	Domain	Count
budget	54829	execorder	51436
execspeech	167548	judiciary	7334
legislative	655777	media	299578
parlspeech	741638	party	131967
publicopinion	2343	social	13954

The language models that the CAP Babel Machine uses were fine-tuned on training data containing the label 'No Policy Content' (code: 999) in addition to the 21 CAP major policy topics, indicating that the given text contains no relevant policy content (but can contain media content).

We have language-specific models for the following languages: Czech, Danish, Dutch, English, French, German, Hungarian, Italian, Norwegian, Polish, Portuguese, Slovak and Spanish , but we encourage you to also submit datasets not covered under this list, as results may be useful for additional languages due to the nature of large language models.

Due to popular demand, we share the ParlLawSpeech corpora with complete CAP Major topic codes generated by our latest models here. (password: u8K!bq6pc6PgA2C55B7H$o) In case of submissions of the entirety or parts of the ParlLawSpeech corpus to the Babel Machine, to save time and cost, we will abort those processes and redirect users here.

You can upload your datasets here for automated CAP-coding. If you wish to submit multiple datasets one after another, please wait 5-10 minutes between each of your submissions. There are two possibilities for upload: pre-coded datasets or non-coded datasets. The explanation of the form and the dataset requirement is available here.

The upload requires to fill the following form on metadata regarding the dataset. We kindly ask you to upload your dataset, and in case of a pre-coded dataset, if available, please attach the codebook used besides the dataset.

The non-coded datasets should contain an id and a text column. The column names must be in row 1. You are free to add supplementary variables to the dataset beyond the compulsory ones in the columns following them.

Pre-coded datasets must contain the following columns: id, year, major_topic, text. The column names must be in row 1. Uploading a pre-coded sample is optional, but it can help us with calculating performance metrics and fine-tuning the language model behind CAP Babel Machine. The detailed rules of validations are available here. The mandatory data format of major_topic is numeric. All textual CAP categories must be converted to the appropriate numeric code before uploading. Furthermore, records with no policy content should be coded with 999. You are free to add supplementary variables to the dataset beyond the compulsory ones in the columns following them. Automatic processing requires to follow these rules.

After you upload your dataset and your file is successfully processed, you will receive the CAP-coded dataset and a file (both in CSV format) that includes the three highest probability category predictions by the CAP Babel model and the corresponding probability (softmax) scores assigned to each label. Please be aware that interpreting softmax scores as absolute model confidences could lead to false assumptions about model performance. The model prediction results are deterministic, so metrics such as the reliability coefficient do not apply.

For an example dataset, please use the following file which you can download here.

If the files you would like to upload are larger than 500 MB, please split your submission into multiple files and submit them one by one. We encourage splitting into smaller file sizes (i.e., 100-300 MB) as the system can more reliably handle smaller files and process them faster than larger files. We recommend waiting some time between each submission, as too many files at once will overload our systems. Please keep in mind that the processing time also depends on the unit of observation as well; a full-text document per row will take longer to process than a sentence per row.

If you have any questions or feedback regarding the CAP Babel Machine, please let us know using our contact form. Please keep in mind that we can only get back to you on Hungarian business days.

Submit a dataset:

Troubleshooting

If you are experiencing problems with the upload form, or your submission returns an error message (particularly "Something unexpected happened during upload. Please try again later."), please try performing the following steps:

If you use an adblocker browser extension, please turn it off for our site. Adblockers may interfere with legitimate functionality, such as the dropdowns on the upload form. (We do not serve ads on the site.)
Try turning off your VPN.
Try submitting your data from another browser, preferably with default settings.

If you are still receiving the "Something unexpected..." error message, please get in touch with us via our email address or the contact form. Try to add as much information as possible, e.g., what browser you are using, notable browser extensions, whether you are using a VPN or not, and exactly how you tried to submit the data (for example, you filled out everything but waited 10 minutes before pressing submit).

This project was supported by the Ministry of Innovation and Technology NRDI Office within the RRF-2.3.1-21-2022-00004 Artificial Intelligence National Laboratory project; the V-Shift Momentum Project of the Hungarian Academy of Sciences; Miklós Sebők's Excellence project (identifier: 151324), which is funded by the Hungarian National Research, Development and Innovation Office's National Research Excellence Programme; and received additional funding from the European Union's Horizon 2020 program under grant agreement no 101008468. We also thank the Babel Machine project and HUN-REN Cloud (Héder et al. 2022; https://science-cloud.hu) for their support. We used the machine learning service of the Slices RI infrastructure (https://www.slices-ri.eu/).

HOW TO CITE: If you use the Babel Machine for your work or research, please cite this paper:

Sebők, M., Máté, Á., Ring, O., Kovács, V., & Lehoczki, R. (2025). Leveraging Open Large Language Models for Multilingual Policy Topic Classification: The Babel Machine Approach. Social Science Computer Review, 43(2), 295–317. https://doi.org/10.1177/08944393241259434

GDPR Compliance Statement

Nature of the Uploaded Data: The files uploaded by users to the tool do not contain personal data as defined in Article 4(1) of the GDPR, which specifies personal data as "any information relating to an identified or identifiable natural person ('data subject')".
Data Process: The files submitted to our tool are stored in a secure cloud environment to allow processing and generation of the output (the coded CSV file). Personal data provided in connection with the file upload—such as the submitter's name, email address, and similar details—are used exclusively for the purpose of sending the coded files back to the user and identifying the organisation of our users. This processing is conducted in compliance with the purpose limitation principle (Article 5(1)(b)) and the data minimisation principle (Article 5(1)(c)) of the GDPR. By submitting the files, the user consents to this data processing, which is strictly limited to returning the results and identifying the file owner. The personal data is stored securely and retained solely for these purposes. In accordance with Article 17 of the GDPR (Right to Erasure, or "Right to be Forgotten"), users may request the deletion of their personal data at any time. Such requests will be processed promptly, and all related personal data will be permanently deleted from our systems.
Training Purposes: We do not use personal data to train machine learning models or perform any other type of analysis. When submitting files, the submitter must declare that the uploaded CSV files do not contain any personal data, as stated in the consent agreement. This approach aligns with the purpose limitation principle (Article 5(1)(b)) of the GDPR, which requires data to be collected for "specified, explicit, and legitimate purposes" and not further processed in a manner incompatible with those purposes.
Google Cloud Platform Compliance: The files submitted to our tool are stored in a secure cloud environment provided by Google Cloud Platform, with configurations ensuring that all processing occurs on servers located within the European Union (EU). This guarantees compliance with GDPR requirements related to data residency and cross-border data transfers. The use of Google Cloud Platform as our processing environment ensures high levels of data security and compliance with GDPR, including the application of the Standard Contractual Clauses (SCCs) for any necessary data transfers. Google Cloud's infrastructure is certified under internationally recognised standards, such as ISO 27001, ISO 27017, and ISO 27018, further ensuring the security and confidentiality of uploaded data.