You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

CoPE-A: The COntent Policy Evaluator Model

Model Overview

CoPE-A-9B is a special purpose content policy evaluation model designed to perform accurate content labeling based on developer-customizable criteria. Built by trust & safety veterans, CoPE-A represents a significant advancement in content classification technology, offering steerable and precise policy interpretation capabilities while maintaining a small and efficient model size.

Key Features

  • Policy-adaptive content evaluation
  • High-accuracy binary classification
  • Steerable (no fixed taxonomy/definitions)
  • Validated across multiple harm categories

Getting Started

For Developers

  • Download google/gemma2-9b base model
  • Merge with the cope adapter (i.e., this model)
  • Perform inference using the prompt template below

Try the Demo

  • Test out a live version of the model in this HF Space

Technical Specifications

Model Architecture

CoPE-A-9B is constructed as a ~9B parameter model, built upon Google's Gemma-2-9b architecture using LoRA fine-tuning techniques. This architecture enables efficient deployment while maintaining high accuracy in content policy evaluation tasks.

Input Format

The model accepts input in a structured format:

INSTRUCTIONS
============

Examine the given POLICY and determine if the given CONTENT meets the criteria for ANY of the LABELS. Answer "1" if yes, and "0" if no.


POLICY
======

# Policy

## Overview

[Short statement of what subject matter the policy is meant to classify]

## Definition of Terms

- [Definition Term 1]: Precise definition of a specific word or phrase that will be used repeatedly in the policy.
- Etc.

## Interpretation of Language

- [Instructions on how to handle ambiguous language for the purposes of this policy]
- [Other relevant guidance for how to go about applying this specific policy]
- Etc.

## Definition of Labels

### (HS): Hate Speech

#### Includes

- [Criteria Name 1]: Precise definition of a specific characteristic of content that causes it to qualify for this label. These can, but do not need to, include caveats specific to this criteria
- Etc.

#### Excludes

- [Criteria Name 1]: Precise definition of a specific characteristic of content that causes it to NOT qualify for this label. These can, but do not need to, include caveats specific to this criteria
- Etc.


CONTENT
=======

{content_text}


ANSWER
======

See our demo for an example of a full policy.

Output Format

CoPE-A provides binary classification outputs:

  • 0: None of the policy labels apply
  • 1: One or more policy labels apply

System Requirements

  • Compatible with commodity GPU hardware
  • Sub-second inference time without any optimization
  • Deployment requirements match Gemma-2-9b base model specifications

Training Details

Training Methodology

The model draws upon research into designing more pluralistic policy interpreters. It employs a novel training methodology (see research talk) that moves beyond simple policy memorization and achieves true policy interpretation capabilities by:

  • Training across conflicting policy formulations
  • Focusing on generalizable policy understanding
  • Emphasizing interpretation consistency

Training Data

CoPE-A's training dataset was carefully curated to ensure robust policy interpretation:

  • Approximately 60,000 labels across unique policy / content pairs
  • Policy texts created by the CoPE team
  • Content data sourced from publicly-accessible internet forums
  • Combined automated and manual processes to produce golden labels
  • Coverage across multiple harm areas, including but not limited to:
    • Hate speech
    • Sexual content
    • Self-harm
    • Harassment
    • Toxicity

Performance Evaluation

Methodology

Evaluation in the policy interpreter space is extremely difficult since most benchmark datasets do not publish the exact policy criteria given to labelers. As such, with the exception of the Ethos hate speech benchmark, we had to largely validate our model based on internal datasets. However, to make our analyses more rigorous, we only evaluated our model based on unique policies and unique content samples that were held out and not present in our training corpus.

It was also difficult to evaluate certain peer models when the harm areas in question were outside those models’ taxonomies. We have accordingly dropped them from the relevant data tables as they performed very poorly (as expected). CoPE, by contrast, has no fixed taxonomy and can accept any arbitrary content policy, similar to foundation models like GPT-4o, which we were able to evaluate in all areas.

Benchmark Results

Hate Speech Classification

Model Precision Recall F1 Score
CoPE-A-9B 89% 93% 91%
GPT-4o 97% 78% 87%
Llama-3.1-8B 59% 96% 73%
LlamaGuard3-8B 88% 64% 74%
ShieldGemma-9B 68% 98% 80%

Hate Speech - Ethos Benchmark (External Set)

Model Precision Recall F1 Score
CoPE-A-9B 80% 88% 84%
GPT-4o 91% 78% 84%
Llama-3.1-8B 62% 92% 74%
LlamaGuard3-8B 87% 79% 83%
ShieldGemma-9B 82% 82% 82%

Toxic Speech Classification

Model Precision Recall F1 Score
CoPE-A-9B 93% 87% 90%
GPT-4o 64% 89% 75%
Llama-3.1-8B 33% 96% 49%

Note: Toxicity is outside the ShieldGemma & LlamaGuard taxonomy

Sexual Content Classification

Model Precision Recall F1 Score
CoPE-A-9B 96% 83% 89%
GPT-4o 95% 72% 82%
Llama-3.1-8B 48% 95% 64%
LlamaGuard3-8B 100% 43% 60%
ShieldGemma-9B 96% 76% 85%

Self-Harm Content Classification

Model Precision Recall F1 Score
CoPE-A-9B 83% 93% 88%
GPT-4o 84% 93% 88%
Llama-3.1-8B 56% 96% 70%
LlamaGuard3-8B 65% 84% 73%
ShieldGemma-9B 69% 89% 78%

Harassment Classification

Model Precision Recall F1 Score
CoPE-A-9B 69% 78% 73%
GPT-4o 100% 17% 30%
Llama-3.1-8B 35% 87% 50%
ShieldGemma-9B 49% 55% 52%

Note: Harassment is outside the LlamaGuard taxonomy

Performance Analysis

CoPE-A demonstrates state of the art performance across multiple policy interpretation evaluation metrics, with significant improvements over comparable models. It even excels in hate speech detection, which is typically one of the most subjective and difficult harm areas. Overall, the model shows balanced precision and recall, making it suitable for production deployment in at-scale content moderation and content classification systems.

Intended Applications

Primary Use Cases

  1. Content Labeling

    • Real-time content moderation
    • Batch processing of content
  2. LLM Guardrails

    • Input prompt risk assessment
    • Output answer risk assessment
    • NB: Not yet optimized for chat format
  3. Content Scoring

    • Feature generation for social feed ranking
    • Content quality assessment & measurement

Prohibited Uses

  • Surveillance applications
  • Applications requiring external fact verification
  • Use cases beyond stated technical limitations

Limitations and Constraints

Current Limitations

  • Text Processing: Limited to 8K tokens (combined policy and content)
  • Language Support: Currently optimized for US English only. Performance will degrade for other languages and locales.
  • Knowledge Constraints: Cannot make classifications requiring external verification (e.g., misinformation) unless explicitly defined in the provided policy
  • Scope: Binary classification only (i.e., presence/absence of matching labels)

Ethical Considerations

Bias and Fairness

While comprehensive bias evaluation is still ongoing, users should:

  • Implement careful policy design to mitigate potential biases
  • Monitor classification patterns across different demographic groups
  • Contribute problematic examples to our bias assessment efforts

Safety Measures

The model's binary classification nature inherently limits certain risks, but users should:

  • Maintain appropriate human oversight
  • Regularly audit classification decisions
  • Implement robust observability systems

Maintenance and Updates

Update Schedule

  • Quarterly releases planned
  • Regular performance improvements
  • Community-driven feature enhancements

Future Roadmap Focus

  1. Expansion to new harm areas
  2. Language and locale support
  3. Multi-modality (e.g., images)
  4. Performance optimizations
  5. Novel evaluation benchmarks

Community and Support

General Resources

For any technical questions or comments, please join our HuggingFace community forum. You can share your feedback, suggest new areas, or pick our brains about anything. If you’d prefer a more private discussion, you can also submit your feedback via this form.

Pilot Partner Program

We are currently accepting a limited number of pilot partners to test real world deployments of CoPE. Partners receive early access to the model weights as well as technical training and support with custom policy development and fine-tuning. If you are interested in joining our trusted partner program, contact us at ([email protected]).

About the Developer

CoPE-A is developed and maintained by Zentropi, an AI Trust & Safety company focused on making content classification simple. The project represents a collaborative effort between industry experts and academic researchers to advance the state of the art in content labeling technology.


Last Updated: December 19, 2024

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Examples
Unable to determine this model's library. Check the docs .

Model tree for zentropi-ai/cope-a-9b

Base model

google/gemma-2-9b
Finetuned
(225)
this model

Space using zentropi-ai/cope-a-9b 1