Open source project for data preparation for GenAI applications
-
Updated
May 15, 2026 - HTML
Open source project for data preparation for GenAI applications
Python package for Customizable Data Preprocessing Pipelines
This repository containing code for preprocessing text data from PDF and DOCX files for use with GPT-3. It includes steps such as tokenization, removal of stop words and punctuation, and formatting for GPT-3 input.
Video quality assessment and filtering pipeline for ML training data. Automatically handles format conversion, scene segmentation, face detection, text detection, and audio-video sync checking. Supports 127 concurrent processes with checkpoint recovery
A lightweight framework for collecting and processing data from HTTP POST requests
The data process library to help better industrial data understanding.
Understand and Implement decision tree
This repository contains a sample text data-preparation code using Nemo Curator for pre-training or synthetic data generation
Pymimic3 is a scalable experimentation platform for MIMIC-III, featuring ready-to-run models, fully tested utilities for concept drift research, and a parallelized, configurable data pipeline.
Project for Machine Learning Data Mining course
Comparative study of CNN and SVM models for facial emotion recognition on CK+ (CNN: 96%, SVM: 97%) and RAF-DB (CNN: 85%, SVM: 77%) datasets. Full data preprocessing pipeline in Python. Published in Springer 2024.
This work highlights my contribution as a "ML Engineer" at "adorsho praniSheb"(an ML based agro farming company of Bangladesh) where I was assigned the task of designing the preprocessing pipeline.
Machine learning models cannot be directly applied to raw data. This desktop application consists of a central server and two client servers. The main servers send raw data to clients, where the data is preprocessed and prepared to be fed to the machine learning model.
Add a description, image, and links to the data-preprocessing-pipelines topic page so that developers can more easily learn about it.
To associate your repository with the data-preprocessing-pipelines topic, visit your repo's landing page and select "manage topics."