GitHub Commit Data Collector

A robust, production-ready Python backend system for collecting and structuring commit-level data from GitHub repositories. This tool fetches detailed commit information including file changes, author details, team mapping, and comprehensive statistics.

Features

GitHub API Integration

Personal Access Token (PAT) authentication
Automatic rate limit handling with intelligent backoff
Retry logic for failed requests
Support for both REST API endpoints

Comprehensive Data Collection

Repository metadata
Commit SHA, message, date, and URL
Author name, GitHub username, and email
Team mapping (configurable)
File-level changes (additions, deletions, modifications)
Line-by-line change tracking
Branch information

Flexible Filtering

Date range filtering
Author filtering
Team filtering
Branch selection
Repository-specific configurations

Multiple Output Formats

JSON (structured, hierarchical)
CSV (flat, spreadsheet-ready)
Separate file for detailed file changes
Collection metadata and statistics

Team Management

Map GitHub users to teams via YAML configuration
Support for multiple teams
Default team for unmapped users
Repository-specific team configurations

Production Features

Structured logging with color output
Configuration via environment variables and YAML
Modular, maintainable architecture
Comprehensive error handling
Progress tracking and status updates

Quick Start

1. Installation

# Clone or extract the project
cd github_commit_collector

# Install dependencies
pip install -r requirements.txt

2. Configuration

Set up GitHub Token

Copy the example environment file and add your GitHub token:

cp .env.example .env

Edit .env and add your GitHub Personal Access Token:

GITHUB_TOKEN=ghp_your_token_here

How to get a GitHub Token:

Go to GitHub Settings → Developer Settings → Personal Access Tokens
Generate new token (classic)
Select scopes: repo (for private repos) or public_repo (for public only)
Copy the token to your .env file

Configure Repositories

Edit config/repositories.yaml:

repositories:
  - url: https://github.com/owner/repo1
    branch: main
    enabled: true
  
  - url: https://github.com/owner/repo2
    branch: develop
    enabled: true

filters:
  date_from: 2024-01-01  # Optional: ISO 8601 date
  date_to: 2024-12-31    # Optional: ISO 8601 date

Configure Team Mapping

Edit config/teams.yaml:

teams:
  backend:
    - alice
    - bob
  
  frontend:
    - charlie
    - diana
  
  devops:
    - eve

default_team: unassigned

3. Run the Collector

Test connection first:

python src/main.py --test-connection

Collect from configured repositories:

python src/main.py

Collect from a specific repository:

python src/main.py --repo https://github.com/owner/repo --branch main

Usage Examples

Basic Collection

# Collect from all configured repositories
python src/main.py

# Collect from a specific repository
python src/main.py --repo https://github.com/torvalds/linux

# Collect from a specific branch
python src/main.py --repo https://github.com/owner/repo --branch develop

Filtering Data

# Filter by date range
python src/main.py --date-from 2024-01-01 --date-to 2024-12-31

# Filter by author
python src/main.py --author octocat

# Filter by team (after collection)
python src/main.py --team backend

# Combine filters
python src/main.py --date-from 2024-06-01 --author alice --team backend

Output Formats

# JSON output (default)
python src/main.py --format json

# CSV output
python src/main.py --format csv

# Both formats
python src/main.py --format both

# Include detailed file changes in CSV
python src/main.py --format csv --include-file-details

# Include patch/diff content in JSON
python src/main.py --format json --include-patch

Advanced Options

# Custom output directory
python src/main.py --output-dir /path/to/output

# Debug logging
python src/main.py --log-level DEBUG

# Skip detailed commit data (faster, but no file-level changes)
python src/main.py --no-detailed-commits

# Use custom config directory
python src/main.py --config-dir /path/to/config

Architecture

Project Structure

github_commit_collector/
├── src/
│   ├── main.py              # CLI entry point
│   ├── config_manager.py    # Configuration handling
│   ├── github_client.py     # GitHub API client
│   ├── team_mapper.py       # Team mapping logic
│   ├── commit_processor.py  # Data processing
│   ├── data_collector.py    # Collection orchestrator
│   ├── data_exporter.py     # Export to JSON/CSV
│   ├── models.py            # Data models
│   └── logger.py            # Logging utilities
├── config/
│   ├── repositories.yaml    # Repository configuration
│   └── teams.yaml          # Team mappings
├── output/                  # Generated output files
│   └── SCHEMA.md           # Output schema documentation
├── logs/                    # Application logs
├── .env.example            # Environment variables template
├── requirements.txt        # Python dependencies
└── README.md              # This file

Core Components

ConfigManager: Loads and validates configuration from .env and YAML files
GitHubAPIClient: Handles GitHub API authentication and requests with rate limiting
TeamMapper: Maps GitHub usernames to team names
CommitProcessor: Transforms raw API data into structured models
DataCollector: Orchestrates the collection process
DataExporter: Exports data to JSON and CSV formats

Data Flow

Configuration → GitHub API → Raw Commits → Processing → Structured Models → Export

Output Files

The collector generates several output files:

JSON Files

commits_TIMESTAMP.json: Complete commit data with metadata
collection_summary_TIMESTAMP.json: High-level statistics
team_summary_TIMESTAMP.json: Team-level aggregations
repository_stats_TIMESTAMP.json: Per-repository statistics

CSV Files

commits_TIMESTAMP.csv: Flat commit data
commits_TIMESTAMP_file_changes.csv: Detailed file changes (with --include-file-details)

See output/SCHEMA.md for detailed schema documentation.

Configuration Reference

Environment Variables (`.env`)

# Required
GITHUB_TOKEN=your_token_here

# Optional - API Configuration
GITHUB_API_URL=https://api.github.com
GITHUB_API_TIMEOUT=30
MAX_RETRIES=3
RATE_LIMIT_BUFFER=10

# Optional - Collection Settings
DEFAULT_BRANCH=main
MAX_COMMITS_PER_REQUEST=100

# Optional - Output Settings
OUTPUT_FORMAT=json
OUTPUT_DIR=output
LOG_LEVEL=INFO
LOG_DIR=logs

Repository Configuration (`config/repositories.yaml`)

repositories:
  - url: https://github.com/owner/repo
    branch: main          # Optional, defaults to 'main'
    enabled: true         # Optional, defaults to true
    filters:              # Optional, repository-specific filters
      date_from: 2024-01-01
      author: alice

filters:  # Global filters (applied to all repositories)
  date_from: null
  date_to: null
  authors: []
  teams: []

Team Mapping (`config/teams.yaml`)

teams:
  team_name:
    - username1
    - username2

default_team: unassigned

# Optional: Repository-specific team assignments
repository_teams:
  owner/repo:
    - team1
    - team2

Rate Limiting

The collector intelligently handles GitHub API rate limits:

Automatic rate limit checking before requests
Configurable buffer to pause before hitting limit
Automatic waiting when rate limit is reached
Detailed logging of rate limit status

Default GitHub API limits:

5,000 requests/hour for authenticated users
60 requests/hour for unauthenticated users

Error Handling

The system includes comprehensive error handling:

API errors: Logged with context and retry logic
Invalid configurations: Validated on startup
Missing data: Graceful handling with warnings
Network issues: Automatic retry with exponential backoff
Keyboard interrupt: Graceful shutdown

All errors are logged to both console and log files.

Performance Considerations

Optimizations

Batch processing: Fetches commits in batches of 100
Pagination: Automatically handles paginated responses
Rate limit awareness: Prevents unnecessary API calls
Conditional detailed fetching: Skip with --no-detailed-commits for faster collection

Performance Tips

Use date filters to limit the number of commits
Skip detailed commits if you don't need file-level changes
Collect during off-peak hours to avoid rate limit contention
Use branch filters to focus on specific development lines

Troubleshooting

Common Issues

"GITHUB_TOKEN not found"

Ensure .env file exists and contains GITHUB_TOKEN=...
Check that .env is in the project root directory

"Rate limit exceeded"

Wait for the rate limit to reset (shown in logs)
Reduce the number of repositories or date range
Use --no-detailed-commits to reduce API calls

"Repository not found"

Verify repository URL is correct
Ensure your token has access to private repositories (if applicable)
Check repository exists and you have read permissions

No commits collected

Check date filters aren't too restrictive
Verify branch name is correct
Check if repository actually has commits in the specified range

Data Privacy & Security

Tokens: Never commit .env file to version control (it's in .gitignore)
Commit messages: Stored as-is from GitHub
Email addresses: Collected from commit metadata
Patch content: Only stored if explicitly requested with --include-patch

Extensions & Customization

Adding Custom Team Mapping Logic

Edit src/team_mapper.py to implement custom logic:

def get_team(self, username: Optional[str]) -> str:
    # Add custom logic here
    if username and username.endswith("_admin"):
        return "admin"
    return super().get_team(username)

Adding Custom Filters

Edit src/commit_processor.py to add filtering logic:

def filter_commits(self, commits: List[CommitData], **kwargs) -> List[CommitData]:
    # Add custom filters
    if kwargs.get("min_changes"):
        commits = [c for c in commits if c.total_changes >= kwargs["min_changes"]]
    return commits

Custom Export Formats

Extend src/data_exporter.py to add new export formats:

def export_to_xml(self, commits: List[CommitData]) -> str:
    # Implement XML export
    pass

Requirements

Python: 3.8 or higher
GitHub Token: Personal Access Token with appropriate scopes
Dependencies: Listed in requirements.txt

License

This project is provided as-is for data collection purposes.

Support

For issues or questions:

Check the troubleshooting section
Review log files in logs/ directory
Verify configuration files are properly formatted
Test GitHub API connection with --test-connection

Changelog

Version 1.0.0

Initial release
GitHub API integration
JSON and CSV export
Team mapping
Comprehensive filtering
Rate limit handling

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
config		config
logs		logs
output		output
src		src
tests		tests
.DS_Store		.DS_Store
.env.example		.env.example
.gitignore		.gitignore
QUICK_REFERENCE.md		QUICK_REFERENCE.md
README.md		README.md
SETUP.md		SETUP.md
output.zip		output.zip
quickstart.sh		quickstart.sh
requirements.txt		requirements.txt

Rahul-Rathodseven/github_commit_collector

Folders and files

Latest commit

History

Repository files navigation