Lighteval documentation

Tasks

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.7.0).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Tasks

LightevalTask

LightevalTaskConfig

class lighteval.tasks.lighteval_task.LightevalTaskConfig

< >

( name: str prompt_function: typing.Callable[[dict, str], lighteval.tasks.requests.Doc | None] hf_repo: str hf_subset: str metric: list[lighteval.metrics.utils.metric_utils.Metric | lighteval.metrics.metrics.Metrics] | tuple[lighteval.metrics.utils.metric_utils.Metric | lighteval.metrics.metrics.Metrics, ...] hf_revision: typing.Optional[str] = None hf_filter: typing.Optional[typing.Callable[[dict], bool]] = None hf_avail_splits: typing.Union[list[str], tuple[str, ...], NoneType] = <factory> trust_dataset: bool = False evaluation_splits: list[str] | tuple[str, ...] = <factory> few_shots_split: typing.Optional[str] = None few_shots_select: typing.Optional[str] = None generation_size: typing.Optional[int] = None generation_grammar: typing.Optional[huggingface_hub.inference._generated.types.text_generation.TextGenerationInputGrammarType] = None stop_sequence: typing.Union[list[str], tuple[str, ...], NoneType] = None num_samples: typing.Optional[list[int]] = None suite: list[str] | tuple[str, ...] = <factory> original_num_docs: int = -1 effective_num_docs: int = -1 must_remove_duplicate_docs: bool = False version: int = 0 )

Parameters

  • name (str) — Short name of the evaluation task.
  • suite (list[str]) — Evaluation suites to which the task belongs.
  • prompt_function (Callable[[dict, str], Doc]) — Function used to create the Doc samples from each line of the evaluation dataset.
  • hf_repo (str) — Path of the hub dataset repository containing the evaluation information.
  • hf_subset (str) — Subset used for the current task, will be default if none is selected.
  • hf_avail_splits (list[str]) — All the available splits in the evaluation dataset
  • evaluation_splits (list[str]) — List of the splits actually used for this evaluation
  • few_shots_split (str) — Name of the split from which to sample few-shot examples
  • few_shots_select (str) — Method with which to sample few-shot examples
  • generation_size (int) — Maximum allowed size of the generation
  • generation_grammar (TextGenerationInputGrammarType) — The grammar to generate completion according to. Currently only available for TGI and Inference Endpoint models.
  • metric (list[str]) — List of all the metrics for the current task.
  • stop_sequence (list[str]) — Stop sequence which interrupts the generation for generative metrics.
  • original_num_docs (int) — Number of documents in the task
  • effective_num_docs (int) — Number of documents used in a specific evaluation
  • truncated_num_docs (bool) — Whether less than the total number of documents were used
  • trust_dataset (bool) — Whether to trust the dataset at execution or not
  • version (int) — The version of the task. Defaults to 0. Can be increased if the underlying dataset or the prompt changes.

Stored configuration of a given LightevalTask.

LightevalTask

class lighteval.tasks.lighteval_task.LightevalTask

< >

( name: str cfg: LightevalTaskConfig cache_dir: typing.Optional[str] = None )

aggregation

< >

( )

Return a dict with metric name and its aggregation function for all metrics

construct_requests

< >

( formatted_doc: Doc context: str document_id_seed: str current_task_name: str ) β†’ dict[RequestType, List[Request]]

Parameters

  • formatted_doc (Doc) — Formatted document almost straight from the dataset.
  • ctx (str) — Context, which is the few shot examples + the query.
  • document_id_seed (str) — Index of the document in the task appended with the seed used for the few shot sampling.
  • current_task_name (str) — Name of the current task.

Returns

dict[RequestType, List[Request]]

List of requests.

Constructs a list of requests from the task based on the given parameters.

eval_docs

< >

( ) β†’ list[Doc]

Returns

list[Doc]

Evaluation documents.

Returns the evaluation documents.

fewshot_docs

< >

( ) β†’ list[Doc]

Returns

list[Doc]

Documents that will be used for few shot examples. One document = one few shot example.

Returns the few shot documents. If the few shot documents are not available, it gets them from the few shot split or the evaluation split.

get_first_possible_fewshot_splits

< >

( available_splits: list[str] | tuple[str, ...] number_of_splits: int = 1 ) β†’ list[str]

Parameters

  • number_of_splits (int, optional) — Number of splits to return. Defaults to 1.

Returns

list[str]

List of the first available fewshot splits.

Parses the possible fewshot split keys in order: train, then validation keys and matches them with the available keys. Returns the first available.

load_datasets

< >

( tasks: list dataset_loading_processes: int = 1 )

Parameters

  • tasks (list) — A list of tasks.
  • dataset_loading_processes (int, optional) — number of processes to use for dataset loading. Defaults to 1.

Load datasets from the HuggingFace Hub for the given tasks.

PromptManager

class lighteval.tasks.prompt_manager.PromptManager

< >

( task: LightevalTask lm: LightevalModel )

doc_to_fewshot_sorting_class

< >

( formatted_doc: Doc ) β†’ str

Parameters

  • formatted_doc (Doc) — Formatted document.

Returns

str

Class of the fewshot document

In some cases, when selecting few-shot samples, we want to use specific document classes which need to be specified separately from the target. For example, a document where the gold is a json might want to use only one of the keys of the json to define sorting classes in few shot samples. Else we take the gold.

doc_to_target

< >

( formatted_doc: Doc ) β†’ str

Parameters

  • formatted_doc (Doc) — Formatted document.

Returns

str

Target of the document, which is the correct answer for a document.

Returns the target of the given document.

doc_to_text

< >

( doc: Doc return_instructions: bool = False ) β†’ str

Parameters

  • doc (Doc) — document class, containing the query and the instructions.

Returns

str

Query of the document without the instructions.

Returns the query of the document without the instructions. If the document has instructions, it removes them from the query:

Registry

class lighteval.tasks.registry.Registry

< >

( cache_dir: typing.Optional[str] = None custom_tasks: typing.Union[str, pathlib.Path, module, NoneType] = None )

The Registry class is used to manage the task registry and get task classes.

expand_task_definition

< >

( task_definition: str ) β†’ list[str]

Parameters

  • task_definition (str) — Task definition to expand. In format:
    • suite|task
    • suite|task_superset (e.g lighteval|mmlu, which runs all the mmlu subtasks)

Returns

list[str]

List of task names (suite|task)

get_task_dict

< >

( task_names: list ) β†’ Dict[str, LightevalTask]

Parameters

  • task_name_list (List[str]) — A list of task names (suite|task).

Returns

Dict[str, LightevalTask]

A dictionary containing the tasks.

Get a dictionary of tasks based on the task name list (suite|task).

Notes:

  • Each task in the task_name_list will be instantiated with the corresponding task class.

get_task_instance

< >

( task_name: str ) β†’ LightevalTask

Parameters

  • task_name (str) — Name of the task (suite|task).

Returns

LightevalTask

Task class.

Raises

ValueError

  • ValueError β€” If the task is not found in the task registry or custom task registry.

Get the task class based on the task name (suite|task).

print_all_tasks

< >

( )

Print all the tasks in the task registry.

Requests

class lighteval.tasks.requests.Request

< >

( task_name: str sample_index: int request_index: int context: str metric_categories: list )

Parameters

  • task_name (str) — The name of the task.
  • sample_index (int) — The index of the example.
  • request_index (int) — The index of the request.
  • context (str) — The context for the request.
  • metric_categories (list[MetricCategory]) — All the metric categories which concern this request

Represents a request for a specific task, example and request within that example in the evaluation process. For example in the task β€œboolq”, the example β€œIs the sun hot?” and the requests for that example β€œIs the sun hot? Yes” and β€œIs the sun hot? No”.

class lighteval.tasks.requests.LoglikelihoodRequest

< >

( task_name: str sample_index: int request_index: int context: str metric_categories: list choice: str tokenized_context: list = None tokenized_continuation: list = None )

Parameters

  • choice (str) — The choice to evaluate the log-likelihood for.
  • request_type (RequestType) — The type of the request (LOGLIKELIHOOD).

Represents a request for log-likelihood evaluation.

class lighteval.tasks.requests.LoglikelihoodSingleTokenRequest

< >

( task_name: str sample_index: int request_index: int context: str metric_categories: list choices: list tokenized_context: list = None tokenized_continuation: list = None )

Parameters

  • choices (list[str]) — The list of token choices.
  • request_type (RequestType) — The type of the request.

Represents a request for calculating the log-likelihood of a single token. Faster because we can get all the loglikelihoods in one pass.

class lighteval.tasks.requests.LoglikelihoodRollingRequest

< >

( task_name: str sample_index: int request_index: int context: str metric_categories: list tokenized_context: list = None tokenized_continuation: list = None )

Represents a request for log-likelihood rolling evaluation.

Inherits from the base Request class.

class lighteval.tasks.requests.GreedyUntilRequest

< >

( task_name: str sample_index: int request_index: int context: str metric_categories: list stop_sequence: typing.Union[str, tuple[str], list[str]] generation_size: typing.Optional[int] generation_grammar: typing.Optional[huggingface_hub.inference._generated.types.text_generation.TextGenerationInputGrammarType] = None tokenized_context: list = None num_samples: int = None do_sample: bool = False use_logits: bool = False )

Parameters

  • stop_sequence (str) — The sequence of tokens that indicates when to stop generating text.
  • generation_size (int) — The maximum number of tokens to generate.
  • generation_grammar (TextGenerationInputGrammarType) — The grammar to generate completion according to. Currently only available for TGI models.
  • request_type (RequestType) — The type of the request, set to RequestType.GREEDY_UNTIL.

Represents a request for generating text using the Greedy-Until algorithm.

class lighteval.tasks.requests.GreedyUntilMultiTurnRequest

< >

( task_name: str sample_index: int request_index: int context: str metric_categories: list stop_sequence: str generation_size: int use_logits: bool = False )

Parameters

  • stop_sequence (str) — The sequence of tokens that indicates when to stop generating text.
  • generation_size (int) — The maximum number of tokens to generate.
  • request_type (RequestType) — The type of the request, set to RequestType.GREEDY_UNTIL.

Represents a request for generating text using the Greedy-Until algorithm.

Datasets

class lighteval.data.DynamicBatchDataset

< >

( requests: list num_dataset_splits: int )

get_original_order

< >

( new_arr: list ) β†’ list

Parameters

  • newarr (list) — Array containing any kind of data that needs to be reset in the original order.

Returns

list

new_arr in the original order.

Get the original order of the data.

get_split_start_end

< >

( split_id: int ) β†’ tuple

Parameters

  • split_id (int) — The ID of the split.

Returns

tuple

A tuple containing the start and end indices of the split.

Get the start and end indices of a dataset split.

splits_start_end_iterator

< >

( ) β†’ tuple

Yields

tuple

Iterator that yields the start and end indices of each dataset split. Also updates the starting batch size for each split (trying to double the batch every time we move to a new split).

class lighteval.data.LoglikelihoodDataset

< >

( requests: list num_dataset_splits: int )

class lighteval.data.LoglikelihoodSingleTokenDataset

< >

( requests: list num_dataset_splits: int )

class lighteval.data.GenerativeTaskDataset

< >

( requests: list num_dataset_splits: int )

init_split_limits

< >

( num_dataset_splits ) β†’ type

Parameters

  • num_dataset_splits (type) — description

Returns

type

description

Initialises the split limits based on generation parameters. The splits are used to estimate time remaining when evaluating, and in the case of generative evaluations, to group similar samples together.

For generative tasks, self._sorting_criteria outputs:

  • a boolean (whether the generation task uses logits)
  • a list (the stop sequences)
  • the item length (the actual size sorting factor).

In the current function, we create evaluation groups by generation parameters (logits and eos), so that samples with similar properties get batched together afterwards. The samples will then be further organised by length in each split.

class lighteval.data.GenerativeTaskDatasetNanotron

< >

( requests: list num_dataset_splits: int )

class lighteval.data.GenDistributedSampler

< >

( dataset: Dataset num_replicas: typing.Optional[int] = None rank: typing.Optional[int] = None shuffle: bool = True seed: int = 0 drop_last: bool = False )

A distributed sampler that copy the last element only when drop_last is False so we keep a small padding in the batches as our samples are sorted by length.

< > Update on GitHub