Model

Pythia 160M (Biderman et al., 2023), autoregressive, GPT-NeoX based, with 12 attention heads and 12 layers.

Training

Training was done for 2 epochs, with a maximum sequence length of 2048 tokens, a learning rate of 1e-4, and an effective batch size of 4.

Training loss 3.22

Validation loss 3.16

Test loss 3.16

Training data

Text-only English dataset combining equal portions of spontaneous speech, staged dialogues, and non-conversational text, comprising the following corpora:

1- Spontaneous Speech (4.2M tokens):

CHILDES by MacWhinney (2000)

BNC Spoken

Switchboard by Godfrey et al. (1992)

CallHome by Canavan et al. (1997)

CallFriend by Canavan et al. (1996a and 1996b)

2- Staged Dialogues (4.2M tokens):

Topical Chat by Gopalakrishnan et al. (2023)

Persona Chat by Zhang et al. (2018)

Daily Dialogue by Li et al. (2017)

3- Non-conversational (4.2M tokens):

FineWeb-Edu by Penedo et al. (2024)

Simple Wikipedia from Wikipedia Dump

KidLM by Nayeem and Rafiei (2024)

Total 12.6M tokens (approximately 10M words)

The portions from CHILDES, BNC Spoken and Switchboard were distributed by the BabyLM committee as part of the training data of the challenge (Warstadt et al., 2023). All other corpora are publicly available for research purposes. Refer to the corresponding websites to access the data. The raw training data is not redistributed with this model; users wishing to reproduce the training setup should obtain each corpus through the official sources linked above.

References

Biderman, S., Schoelkopf, H., Anthony, Q. G., Bradley, H., O’Brien, K., Hallahan, E., ... & Van Der Wal, O. (2023, July). Pythia: A suite for analyzing large language models across training and scaling. In International conference on machine learning (pp. 2397-2430). PMLR.

Canavan, A., Graff, D., & Zipperlen, G. (1997). Callhome american english speech. Linguistic Data Consortium.

Canavan, A., & Zipperlen, G. (1996a). CALLFRIEND American English-Non-Southern Dialect LDC96S46. Philadelphia: Linguistic Data Consortium.

Canavan, A., & Zipperlen, G. (1996b). CALLFRIEND American English-Southern Dialect LDC96S47. Philadelphia: Linguistic Data Consortium.

Godfrey, J. J., Holliman, E. C., & McDaniel, J. (1992, March). SWITCHBOARD: Telephone speech corpus for research and development. In [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing (Vol. 1, pp. 517-520). IEEE.

Gopalakrishnan, K., Hedayatnia, B., Chen, Q., Gottardi, A., Kwatra, S., Venkatesh, A., ... & Hakkani-Tur, D. (2023). Topical-chat: Towards knowledge-grounded open-domain conversations. arXiv preprint arXiv:2308.11995.

Li, Y., Su, H., Shen, X., Li, W., Cao, Z., & Niu, S. (2017, November). Dailydialog: A manually labelled multi-turn dialogue dataset. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (pp. 986-995).

MacWhinney, B. (2000). The CHILDES project: The database (Vol. 2). Psychology Press.

Nayeem, M. T., & Rafiei, D. (2024, November). KidLM: Advancing language models for children–early insights and future directions. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (pp. 4813-4836).

Penedo, G., Kydlíček, H., Lozhkov, A., Mitchell, M., Raffel, C., Von Werra, L., & Wolf, T. (2024). The fineweb datasets: Decanting the web for the finest text data at scale. Advances in Neural Information Processing Systems, 37, 30811-30849.

Warstadt, A., Mueller, A., Choshen, L., Wilcox, E., Zhuang, C., Ciro, J., ... & Cotterell, R. (2023, December). Findings of the BabyLM challenge: Sample-efficient pretraining on developmentally plausible corpora. In Proceedings of the babylm challenge at the 27th conference on computational natural language learning (pp. 1-34).

Zhang, S., Dinan, E., Urbanek, J., Szlam, A., Kiela, D., & Weston, J. (2018, July). Personalizing dialogue agents: I have a dog, do you have pets too?. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 2204-2213).

Downloads last month: 52

Safetensors

Model size

0.2B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including anonymous-sub1/mixed-model

models

Collection

4 items • Updated 18 days ago

Paper for anonymous-sub1/mixed-model

Topical-Chat: Towards Knowledge-Grounded Open-Domain Conversations

Paper • 2308.11995 • Published Aug 23, 2023