CHATUtterance-en / README.md
jemoka's picture
Update README.md
764ec3f verified
|
raw
history blame
1.33 kB
---
language:
- en
---
# TalkBank Batchalign CHATUtterance
CHATUtterance is a series of Bert-derivative models designed for the task of Utterance Segmentation released by the TalkBank project, which is trained on the the utterance diarization samples given by [The Michigan Corpus of Academic Spoken English](https://ca.talkbank.org/access/MICASE.html).
## Usage
The models can be used directly as a Bert-class token classification model following the [instructions from Huggingface](https://huggingface.co/docs/transformers/tasks/token_classification). Feel free to inspect [this file](https://github.com/TalkBank/batchalign/blob/73ec04761ed3ee2eba04ba0cf14dc898f88b72f7/baln/utokengine.py#L85-L94) for a sense of what the classes means. Alternatively, to get the full analysis possible with the model, it is best combined with the TalkBank Batchalign suite of analysis software, [available here](https://github.com/talkbank/batchalign2), using `transcribe` mode.
Target labels:
- `0`: regular form
- `1`: start of utterance/capitalized word
- `2`: end of declarative utterance (end this utterance with a `.`)
- `3`: end of interrogative utterance (end this utterance with a `?`)
- `4`: end of exclamatory utterance (end this utterance with a `!`)
- `5`: break in the utterance; depending on orthography one can insert a `,`