OA conversation 12B model checkpoint 1k

I accidentally checkout my config and current code changes, so I kinda forgot the list of datasets and limit I added in all datasets mix. But basically its not a finish product, but kinda do context selection pretty well

When I resumed the training with new config, the newer models aren't as great as I am expecting to be, so I just abandoned the resumed progress and decided to just upload this one.

How to start a conversation run the model:

Go to Open Assistant's model_training section, we have a tool which are used to start a simple conversation locally

python -m tools.model_chat --model_path theblackcat102/pythia-12B-dedup-1000
> How many continents on Earth
1. There are 7 continents on Earth, Antarctica being the only continent not supported by a landmass. They are: Asia, Africa, North America, South America, Europe, Australia, and Antarctica.
2. Antarctica is a continent situated on the southern tip of South America, between Africa and Australia. It is the northernmost continent and the coldest inhabited place in the world.
3. The continents are a result of plate tectonics, a process whereby the earths crust moves horizontally, and so is able to break and join in order to make new landmasses. The continents are the result of the different plates of the earth moving apart (the Atlantic plate being the oldest, moving away from Africa and the Pacific being the younger, with Europe being the newest.)
> Whats the population on Earth
As of 2019, the total population on Earth is estimated to be around 7,600,000,000 people, with 4,500,000,000 living in developed countries, 1,400,000,000 living in developing countries, and 1,000,000,000 living as refugees, displaced, or in exile.
> 地球上有多少人
总人口估计约为7亿六千万人,总人均寿命约为76岁, 所有领域的发展水平分别为7万千人左右,占世界总人口的四
分之一,2亿五千万人左右,占世界总人口的十分之一。 目前,全球四、五分之一的人居住在发展中国家,占世界发展中国家
比例约为8万亿人,占全球发展中国家和脱浪,脱脂的人比率约为7亿亿
> 地球上有幾個大陸板塊
1. 地球上有7個大陸板塊,包括亞洲,拉丁美洲,北美洲,歐洲,澳洲,和亞洲島。
2. 亞洲島是南美洲海岸的島嶼。 它位於亞洲板塊的外部,大約在拉丁美洲海島嶼和歐洲海岸。
3. 大陸板塊是由地震、自然流和地球在不斷的移动和分裂、约束和解放而产生的结果。
> 幫我把這句話翻譯成中文:a brown fox jumps over the lazy dog
a 黑色的狐跳過一隻睡覺的狗。

Some test in multilingual conversation, each starting question contain no historical conversation

>Siapa ni (malay or indonesian)
Joi disini bercakap dengan cara terbaik dari pengeluaran pendapatan, pembayaran utang, pinjaman, dan tindakan pembayaran, karena ini adalah cara yang sangat efektif untuk membantu orang berkebimbangan dengan mempertimbangkannya dan mengelola pendapatan.
> いらっしゃいます (japanese)
emtpy token
> 누구세요 (korean)
You may be able to help me by explaining your symptoms to me.
> 你好
请问你有关“Joi 的实用助手”如何实现某些行为么?

As of now results are very random and may varies

Issues I am trying to tackle here

  • Language

Pythia was trained on the Pile, an English based corpus with sprinkles of non English words in between. So translation plays a huge role in joining the bytes token sets to learn some kind of pattern for non English language.

  • History context selection

Make sure the heads are learning to decide whether to ignore or attend what's relevant to the latest question. I have some ideas and one of them is random prepend irrelevant context to the sequence prefix.

FYI: Does anyone has any idea whether pythia series could do CoT?

Warning

This is a model that is only 1% complete and has not yet been fully cooked. I am uploading it here for the purpose of comparison, as these models are taking up more space than I would like

Model trained only in SFT

Many thanks

  • LAION & EleutherAI : basically a dozen of individuals who band together to make open research possible in this age of conflict and greed
  • Stability.ai : this project wouldn't be possible without their compute resource
  • Teams and contributors at Open Assistant : who put their time after their day job or whatever into this project
  • Huggingface : For the storage and spaces here
Downloads last month
14
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Datasets used to train theblackcat102/pythia-12B-dedup-1000