Sebastian Gabarain

Locutusque

AI & ML interests

Pushing performance in small language models

Organizations

Locutusque's activity

replied to their post 8 days ago
view reply

Being uncensored doesn’t directly improve performance. The DPOP algorithm improved performance in I believe every benchmark. In other words, neural chat has higher benchmark scores than orca.

replied to their post 9 days ago
view reply

Neural chat is uncensored because the data it was trained on contains toxic DPO.

replied to lorinma's post 15 days ago
posted an update 26 days ago
view post
Post
2336
Introducing llama-3-neural-chat-v2.2-8b! This powerful conversational AI model builds on Meta's Llama 3, fine-tuned by Locutusque for enhanced performance in coding, math & writing.

Locutusque/llama-3-neural-chat-v2.2-8B
  • 4 replies
·
posted an update about 1 month ago
view post
Post
3373
I created a Twitter account a while back. I finally decided to make it public SebastianG74019. For those of you following @Locutusque on Twitter, that is not me! 😂
  • 2 replies
·
replied to their post 2 months ago
replied to their post 2 months ago
view reply

Your right. I did mention this in the dataset card that it does not match the size of the Cerebrum dataset, and is something I'm going to try to achieve in the future, and this is used as a way to sort of test how I would go about structuring such a dataset. For now I'm trying to achieve the same performance, then I'll work towards structuring it similarly to the Cerebrum dataset. Thank you for holding me accountable about this.

posted an update 2 months ago
view post
Post
2617
Exciting news! 🎉 I've created the OpenCerebrum datasets, open-source alternatives to Aether Research's proprietary Cerebrum dataset.

The first, OpenCerebrum SFT, is a text-generation and question-answering dataset with ~1.2M examples, curated from sources like Open-Orca, glaiveai, camel-ai, and more! 📚

The second, OpenCerebrum DPO, is a smaller dataset with ~21k examples, focusing on data point optimization. It's curated from sources like jondurbin, argilla, grimulkan, and others. 📊

Both datasets are licensed under Apache-2.0 and are available in English. They're ready for use in your projects, and I welcome any feedback for future improvements! 🚀

Locutusque/OpenCerebrum-dpo
Locutusque/OpenCerebrum-SFT
Locutusque/OpenCerebrum-1.0-7b-SFT
Locutusque/OpenCerebrum-1.0-7b-DPO
·
posted an update 3 months ago
view post
Post
🚀 Excited to unveil the Augmented ARC-Challenge Dataset with Chain-of-Thought Reasoning! 🧠✨

📚 Created by enhancing the ARC dataset with AI-generated reasoning from Google's Gemini Pro, this resource aims to improve question answering models' ability to tackle complex science queries.

🔍 Features:
- 1068 training examples
- Detailed reasoning steps for nuanced understanding
- Questions spanning physics, chemistry, biology, & more!

🌟 Ideal for benchmarking QA models, enhancing model interpretability, and studying in-context examples.

🔗 Dive in and help your models learn the art of reasoning!

🔎 Explore more: Locutusque/arc-cot
posted an update 3 months ago
view post
Post
🚀 Introducing UltraTextbooks v2: The Ultimate Educational NLP Dataset! 📚

I've expanded the dataset to include an even wider range of high-quality textbooks, with a special focus on machine learning, mathematics, and coding. 💻🧮

With over 3 million examples and 6 GB of data, UltraTextbooks v2 is your go-to resource for training advanced language models and developing cutting-edge educational applications. 🎓

Explore the dataset on Hugging Face and unlock the power of AI in education! 🔓

Locutusque/UltraTextbooks-2.0
replied to their post 4 months ago
view reply

Yes, I’ll open a discussion in the repository where you can ask questions about the dataset.

posted an update 4 months ago
view post
Post
🚨📢🚀 Introducing Hercules-v2.0! A robust, multifaceted dataset for advanced models to excel in specialized domains. 🔬🌌📚🚀

📈 1.3M examples from sources derived from OpenHermes-2.5, covering Biology, Physics, Math, CS, Instruction Following, Function Calling, and Roleplay.

🔬 Enhance natural language understanding and processing in diverse domains.

🚀 Develop models for complex instructions, function calls, and roleplay scenarios.

📄 Licensed under Apache-2.0.

Thank you to all contributors and OpenHermes-2.5 creator! 🎉


Check it out here: Locutusque/hercules-v2.0

📣 Update: After fine-tuning Mistral 7B on 100,000 examples of Hercules-v2.0, it earns an average score of 62 on Open LLM Leaderboard, outperforming OpenHermes-2.5 and OpenChat-3.5. 🎉

Check out this model here: Locutusque/Hercules-2.0-Mistral-7B
  • 3 replies
·
posted an update 4 months ago
view post
Post
Introducing the "UltraTextbooks" dataset 🚀📚
Check it out here: Locutusque/UltraTextbooks
📘 A comprehensive collection of high-quality synthetic and human-written textbooks
👨‍🎓 Spanning various subjects and programming languages
🔧 Designed for advanced NLP tasks like language modeling, educational QA, text summarization, and content generation for edu purposes
🚀 Future expansions planned with additional data sources to enhance the corpus
👇 Data composition highlights 👇
- Blend of synthetic and human-written material
- Includes topics from general edu to specialized areas
- Structured with field "text"
🧩 Data collection from various Hugging Face datasets, guided by a diverse and comprehensive curation rationale
🚧 Limitations may exist, so report any issues you encounter
  • 2 replies
·
posted an update 4 months ago
view post
Post
Hello everyone,
This is my first post! I have also decided to release a dataset that I have been keeping private for a while now. I’ve kept it private because I’m not sure if it is actually good or not. I would greatly appreciate it if someone could fine-tune some larger models and evaluate the dataset. Named Hercules-v1.0, it is a turbo-charged version of teknium’s openhermes generated by augmenting its data sources. Learn more in the dataset card: Locutusque/hercules-v1.0