AI & ML interests
Accelerate the frontier of AI development with enterprise-grade, deeply curated datasets engineered to enhance pre-training, alignment, and real-world performance.
Recent Activity
View all activity
Sample Datasets of Coding dataset for benchmarking and domain specific AI models
Sample Datasets of dual-channel call center audio with separate agent and customer channels for ASR, diarization, and conversational AI training.
-
InfoBayAI/English_United_States_Call_Center_Audio_Dataset_Dual_Channel
Viewer • Updated • 6 • 48 • 1 -
InfoBayAI/English_United_Kingdom_Call_Center_Audio_Dataset_Dual_Channel
Viewer • Updated • 9 • 35 -
InfoBayAI/Hindi_Call_Center_Audio_Dataset_Dual_Channel
Viewer • Updated • 8 • 39 -
InfoBayAI/Arabic_Call_Center_Audio_Dataset_Dual_Channel
Viewer • Updated • 6 • 41
Sample datasets from a 6.5M+ enterprise-grade Q&A corpus across STEM and Non-STEM domains, built for LLM training, instruction tuning, and evaluation.
Sample dataset from an enterprise-grade medical corpus built for clinical AI, diagnosis support, and healthcare LLM training.
-
InfoBayAI/MRI-Radiology-Reports-Without-Findings-Dataset
Viewer • Updated • 588 • 30 -
InfoBayAI/CT-Scan-Radiology-Reports-Without-Findings-Dataset
Viewer • Updated • 2.6k • 30 • 2 -
InfoBayAI/CT-Scan-Radiology-Reports-With-Findings-Dataset
Viewer • Updated • 6.3k • 23 • 2 -
InfoBayAI/X-Ray-Radiology-Reports-Without-Findings-Dataset
Viewer • Updated • 9 • 24 • 2
-
InfoBayAI/Arabic-Call-Center-Audio-Dataset-Single-Channel
Viewer • Updated • 3 • 3 -
InfoBayAI/Somali-Call-Center-Audio-Dataset-Single-Channel
Viewer • Updated • 6 -
InfoBayAI/Mizo-Call-Center-Audio-Dataset-Single-Channel
Viewer • Updated • 6 -
InfoBayAI/French-Call-Center-Audio-Dataset-Single-Channel
Viewer • Updated • 3 • 1
Sample from a podcast audio dataset, designed for ASR, speech recognition, and conversational AI training using diverse, real-world spoken content.
Sample of a 2.6+ word textbook corpus across 39K+ books, 5K+ subjects, and 15 languages for LLM training and multilingual knowledge modeling.
Sample dataset from multilingual image corpus covering medical, STEM, Non-STEM, automobile, and complex domains for computer vision and multimodal AI.
Sample Datasets of Coding dataset for benchmarking and domain specific AI models
Sample dataset from an enterprise-grade medical corpus built for clinical AI, diagnosis support, and healthcare LLM training.
-
InfoBayAI/MRI-Radiology-Reports-Without-Findings-Dataset
Viewer • Updated • 588 • 30 -
InfoBayAI/CT-Scan-Radiology-Reports-Without-Findings-Dataset
Viewer • Updated • 2.6k • 30 • 2 -
InfoBayAI/CT-Scan-Radiology-Reports-With-Findings-Dataset
Viewer • Updated • 6.3k • 23 • 2 -
InfoBayAI/X-Ray-Radiology-Reports-Without-Findings-Dataset
Viewer • Updated • 9 • 24 • 2
Sample Datasets of dual-channel call center audio with separate agent and customer channels for ASR, diarization, and conversational AI training.
-
InfoBayAI/English_United_States_Call_Center_Audio_Dataset_Dual_Channel
Viewer • Updated • 6 • 48 • 1 -
InfoBayAI/English_United_Kingdom_Call_Center_Audio_Dataset_Dual_Channel
Viewer • Updated • 9 • 35 -
InfoBayAI/Hindi_Call_Center_Audio_Dataset_Dual_Channel
Viewer • Updated • 8 • 39 -
InfoBayAI/Arabic_Call_Center_Audio_Dataset_Dual_Channel
Viewer • Updated • 6 • 41
-
InfoBayAI/Arabic-Call-Center-Audio-Dataset-Single-Channel
Viewer • Updated • 3 • 3 -
InfoBayAI/Somali-Call-Center-Audio-Dataset-Single-Channel
Viewer • Updated • 6 -
InfoBayAI/Mizo-Call-Center-Audio-Dataset-Single-Channel
Viewer • Updated • 6 -
InfoBayAI/French-Call-Center-Audio-Dataset-Single-Channel
Viewer • Updated • 3 • 1
Sample from a podcast audio dataset, designed for ASR, speech recognition, and conversational AI training using diverse, real-world spoken content.
Sample of a 2.6+ word textbook corpus across 39K+ books, 5K+ subjects, and 15 languages for LLM training and multilingual knowledge modeling.
Sample datasets from a 6.5M+ enterprise-grade Q&A corpus across STEM and Non-STEM domains, built for LLM training, instruction tuning, and evaluation.
Sample dataset from multilingual image corpus covering medical, STEM, Non-STEM, automobile, and complex domains for computer vision and multimodal AI.