A robust, production-ready Python backend system for collecting and structuring commit-level data from GitHub repositories. This tool fetches detailed commit information including file changes, author details, team mapping, and comprehensive statistics.
GitHub API Integration
- Personal Access Token (PAT) authentication
- Automatic rate limit handling with intelligent backoff
- Retry logic for failed requests
- Support for both REST API endpoints
Comprehensive Data Collection
- Repository metadata
- Commit SHA, message, date, and URL
- Author name, GitHub username, and email
- Team mapping (configurable)
- File-level changes (additions, deletions, modifications)
- Line-by-line change tracking
- Branch information
Flexible Filtering
- Date range filtering
- Author filtering
- Team filtering
- Branch selection
- Repository-specific configurations
Multiple Output Formats
- JSON (structured, hierarchical)
- CSV (flat, spreadsheet-ready)
- Separate file for detailed file changes
- Collection metadata and statistics
Team Management
- Map GitHub users to teams via YAML configuration
- Support for multiple teams
- Default team for unmapped users
- Repository-specific team configurations
Production Features
- Structured logging with color output
- Configuration via environment variables and YAML
- Modular, maintainable architecture
- Comprehensive error handling
- Progress tracking and status updates
# Clone or extract the project
cd github_commit_collector
# Install dependencies
pip install -r requirements.txtCopy the example environment file and add your GitHub token:
cp .env.example .envEdit .env and add your GitHub Personal Access Token:
GITHUB_TOKEN=ghp_your_token_hereHow to get a GitHub Token:
- Go to GitHub Settings → Developer Settings → Personal Access Tokens
- Generate new token (classic)
- Select scopes:
repo(for private repos) orpublic_repo(for public only) - Copy the token to your
.envfile
Edit config/repositories.yaml:
repositories:
- url: https://github.com/owner/repo1
branch: main
enabled: true
- url: https://github.com/owner/repo2
branch: develop
enabled: true
filters:
date_from: 2024-01-01 # Optional: ISO 8601 date
date_to: 2024-12-31 # Optional: ISO 8601 dateEdit config/teams.yaml:
teams:
backend:
- alice
- bob
frontend:
- charlie
- diana
devops:
- eve
default_team: unassignedTest connection first:
python src/main.py --test-connectionCollect from configured repositories:
python src/main.pyCollect from a specific repository:
python src/main.py --repo https://github.com/owner/repo --branch main# Collect from all configured repositories
python src/main.py
# Collect from a specific repository
python src/main.py --repo https://github.com/torvalds/linux
# Collect from a specific branch
python src/main.py --repo https://github.com/owner/repo --branch develop# Filter by date range
python src/main.py --date-from 2024-01-01 --date-to 2024-12-31
# Filter by author
python src/main.py --author octocat
# Filter by team (after collection)
python src/main.py --team backend
# Combine filters
python src/main.py --date-from 2024-06-01 --author alice --team backend# JSON output (default)
python src/main.py --format json
# CSV output
python src/main.py --format csv
# Both formats
python src/main.py --format both
# Include detailed file changes in CSV
python src/main.py --format csv --include-file-details
# Include patch/diff content in JSON
python src/main.py --format json --include-patch# Custom output directory
python src/main.py --output-dir /path/to/output
# Debug logging
python src/main.py --log-level DEBUG
# Skip detailed commit data (faster, but no file-level changes)
python src/main.py --no-detailed-commits
# Use custom config directory
python src/main.py --config-dir /path/to/configgithub_commit_collector/
├── src/
│ ├── main.py # CLI entry point
│ ├── config_manager.py # Configuration handling
│ ├── github_client.py # GitHub API client
│ ├── team_mapper.py # Team mapping logic
│ ├── commit_processor.py # Data processing
│ ├── data_collector.py # Collection orchestrator
│ ├── data_exporter.py # Export to JSON/CSV
│ ├── models.py # Data models
│ └── logger.py # Logging utilities
├── config/
│ ├── repositories.yaml # Repository configuration
│ └── teams.yaml # Team mappings
├── output/ # Generated output files
│ └── SCHEMA.md # Output schema documentation
├── logs/ # Application logs
├── .env.example # Environment variables template
├── requirements.txt # Python dependencies
└── README.md # This file
- ConfigManager: Loads and validates configuration from
.envand YAML files - GitHubAPIClient: Handles GitHub API authentication and requests with rate limiting
- TeamMapper: Maps GitHub usernames to team names
- CommitProcessor: Transforms raw API data into structured models
- DataCollector: Orchestrates the collection process
- DataExporter: Exports data to JSON and CSV formats
Configuration → GitHub API → Raw Commits → Processing → Structured Models → Export
The collector generates several output files:
commits_TIMESTAMP.json: Complete commit data with metadatacollection_summary_TIMESTAMP.json: High-level statisticsteam_summary_TIMESTAMP.json: Team-level aggregationsrepository_stats_TIMESTAMP.json: Per-repository statistics
commits_TIMESTAMP.csv: Flat commit datacommits_TIMESTAMP_file_changes.csv: Detailed file changes (with--include-file-details)
See output/SCHEMA.md for detailed schema documentation.
# Required
GITHUB_TOKEN=your_token_here
# Optional - API Configuration
GITHUB_API_URL=https://api.github.com
GITHUB_API_TIMEOUT=30
MAX_RETRIES=3
RATE_LIMIT_BUFFER=10
# Optional - Collection Settings
DEFAULT_BRANCH=main
MAX_COMMITS_PER_REQUEST=100
# Optional - Output Settings
OUTPUT_FORMAT=json
OUTPUT_DIR=output
LOG_LEVEL=INFO
LOG_DIR=logsrepositories:
- url: https://github.com/owner/repo
branch: main # Optional, defaults to 'main'
enabled: true # Optional, defaults to true
filters: # Optional, repository-specific filters
date_from: 2024-01-01
author: alice
filters: # Global filters (applied to all repositories)
date_from: null
date_to: null
authors: []
teams: []teams:
team_name:
- username1
- username2
default_team: unassigned
# Optional: Repository-specific team assignments
repository_teams:
owner/repo:
- team1
- team2The collector intelligently handles GitHub API rate limits:
- Automatic rate limit checking before requests
- Configurable buffer to pause before hitting limit
- Automatic waiting when rate limit is reached
- Detailed logging of rate limit status
Default GitHub API limits:
- 5,000 requests/hour for authenticated users
- 60 requests/hour for unauthenticated users
The system includes comprehensive error handling:
- API errors: Logged with context and retry logic
- Invalid configurations: Validated on startup
- Missing data: Graceful handling with warnings
- Network issues: Automatic retry with exponential backoff
- Keyboard interrupt: Graceful shutdown
All errors are logged to both console and log files.
- Batch processing: Fetches commits in batches of 100
- Pagination: Automatically handles paginated responses
- Rate limit awareness: Prevents unnecessary API calls
- Conditional detailed fetching: Skip with
--no-detailed-commitsfor faster collection
- Use date filters to limit the number of commits
- Skip detailed commits if you don't need file-level changes
- Collect during off-peak hours to avoid rate limit contention
- Use branch filters to focus on specific development lines
"GITHUB_TOKEN not found"
- Ensure
.envfile exists and containsGITHUB_TOKEN=... - Check that
.envis in the project root directory
"Rate limit exceeded"
- Wait for the rate limit to reset (shown in logs)
- Reduce the number of repositories or date range
- Use
--no-detailed-commitsto reduce API calls
"Repository not found"
- Verify repository URL is correct
- Ensure your token has access to private repositories (if applicable)
- Check repository exists and you have read permissions
No commits collected
- Check date filters aren't too restrictive
- Verify branch name is correct
- Check if repository actually has commits in the specified range
- Tokens: Never commit
.envfile to version control (it's in.gitignore) - Commit messages: Stored as-is from GitHub
- Email addresses: Collected from commit metadata
- Patch content: Only stored if explicitly requested with
--include-patch
Edit src/team_mapper.py to implement custom logic:
def get_team(self, username: Optional[str]) -> str:
# Add custom logic here
if username and username.endswith("_admin"):
return "admin"
return super().get_team(username)Edit src/commit_processor.py to add filtering logic:
def filter_commits(self, commits: List[CommitData], **kwargs) -> List[CommitData]:
# Add custom filters
if kwargs.get("min_changes"):
commits = [c for c in commits if c.total_changes >= kwargs["min_changes"]]
return commitsExtend src/data_exporter.py to add new export formats:
def export_to_xml(self, commits: List[CommitData]) -> str:
# Implement XML export
pass- Python: 3.8 or higher
- GitHub Token: Personal Access Token with appropriate scopes
- Dependencies: Listed in
requirements.txt
This project is provided as-is for data collection purposes.
For issues or questions:
- Check the troubleshooting section
- Review log files in
logs/directory - Verify configuration files are properly formatted
- Test GitHub API connection with
--test-connection
- Initial release
- GitHub API integration
- JSON and CSV export
- Team mapping
- Comprehensive filtering
- Rate limit handling