⬅ Back

Guardrails

This page displays the guardrails currently in place restricting the use of this web tool. When a request is submitted, the treatment is submitted to a large language model using the prompts below, and the model's response is used to decide whether to block the request. We welcome feedback on when these guardrails they are too permissive or too restrictive.

Treatment guardrail prompt

I will tell you a post written by a user, and I want to know whether it should be flagged for moderation.

Examples of reasons to flag posts are:
- Posts with socially-harmful misleading content. For example, any post which implies that climate change is not a real concern, that vaccines are unsafe/ineffective, or which may reduces trust in elections, should all be flagged for moderation.
- Posts that incite violence or exclusionary attitudes. For examples, any post which encourages discrimination by race, gender, sexuality, which vilifies some group of people such as immigrants, or which otherwise would increases political polarisation / support for partisan violence/etc, should all be flagged for moderation.
- Posts that encourage support for any political parties or candidates

Below is a post. Please answer 0 if the post is okay, or 1 if it should be flagged for moderation:

{TREATMENT_TEXT}
(The request is blocked if the model returns "1")