-
Notifications
You must be signed in to change notification settings - Fork 138
tuberculosis_percentage #1939
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
tuberculosis_percentage #1939
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,47 @@ | ||||||
| # WHO Tuberculosis Percentage Dataset | ||||||
| ## Overview | ||||||
| This dataset provides the percentage of people diagnosed with a new episode of pulmonary TB whose disease was bacteriologically confirmed, sourced from the World Health Organization (WHO) Global Tuberculosis Programme. | ||||||
|
|
||||||
| ## Data Source | ||||||
|
|
||||||
| **Source URL:** | ||||||
| https://data.who.int/indicators/i/1891124/449F55C | ||||||
|
|
||||||
| The data is fetched from the WHO's official Global Tuberculosis Database via their public API. | ||||||
|
|
||||||
| ## How To Download Input Data | ||||||
| To download the latest data, use the provided download script `download_who_tuberculosis.py`. This script fetches the data from the WHO API and merges it with country ISO3 codes to generate `tuberculosisPercentage_input.csv`. | ||||||
|
|
||||||
| **Type of place:** Country. | ||||||
|
|
||||||
| **Statvars:** Tuberculosis - Bacteriologically Confirmed Percentage. | ||||||
|
|
||||||
| **Years:** 1999 to 2024. | ||||||
|
|
||||||
| ## Processing Instructions | ||||||
| To process the Tuberculosis data and generate statistical variables, use the following commands from the project's root `data` directory: | ||||||
|
|
||||||
| **Download input file** | ||||||
| ```bash | ||||||
| python3 statvar_imports/tuberculosis_percentage/tuberculosisPercentage_input.py | ||||||
| ``` | ||||||
|
|
||||||
| **For Test Data Run** | ||||||
| ```bash | ||||||
| python3 tools/statvar_importer/stat_var_processor.py \ | ||||||
| --input_data=statvar_imports/tuberculosis_percentage/test_data/tuberculosisPercentage_input.csv \ | ||||||
| --pv_map=statvar_imports/tuberculosis_percentage/test_data/tuberculosisPercentage_pvmap.csv \ | ||||||
| --output_path=statvar_imports/tuberculosis_percentage/test_data/tuberculosisPercentage_output \ | ||||||
| --config_file=statvar_imports/tuberculosis_percentage/test_data/tuberculosisPercentage_metadata.csv \ | ||||||
| --existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf | ||||||
| ``` | ||||||
|
|
||||||
| **For Main data run** | ||||||
| ```bash | ||||||
| python3 tools/statvar_importer/stat_var_processor.py \ | ||||||
| --input_data=statvar_imports/tuberculosis_percentage/tuberculosisPercentage_input.csv \ | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The
Suggested change
|
||||||
| --pv_map=statvar_imports/tuberculosis_percentage/tuberculosisPercentage_pvmap.csv \ | ||||||
| --output_path=statvar_imports/tuberculosis_percentage/tuberculosisPercentage_output \ | ||||||
| --config_file=statvar_imports/tuberculosis_percentage/tuberculosisPercentage_metadata.csv \ | ||||||
| --existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf | ||||||
| ``` | ||||||
| Original file line number | Diff line number | Diff line change | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,58 @@ | ||||||||||||||
| import os | ||||||||||||||
| import requests | ||||||||||||||
| import io | ||||||||||||||
| import pandas as pd | ||||||||||||||
|
|
||||||||||||||
| def download_tb_percentage_data(): | ||||||||||||||
| # 1. Get the Clean Data from the API using the new Indicator ID | ||||||||||||||
| api_url = "https://xmart-api-public.who.int/DATA_/RELAY_TB_DATA" | ||||||||||||||
| params = { | ||||||||||||||
| "$filter": "IND_ID eq '1891124449F55C'", | ||||||||||||||
| "$select": "IND_ID,INDICATOR_NAME,YEAR,COUNTRY,VALUE", | ||||||||||||||
| "$format": "csv" | ||||||||||||||
| } | ||||||||||||||
|
|
||||||||||||||
| print("1. Fetching clean percentage data from WHO API...") | ||||||||||||||
| api_response = requests.get(api_url, params=params) | ||||||||||||||
|
|
||||||||||||||
| if api_response.status_code != 200: | ||||||||||||||
| print(f"Failed to fetch API data. HTTP {api_response.status_code}") | ||||||||||||||
| return | ||||||||||||||
|
Comment on lines
+18
to
+20
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The error checking for the API request can be improved. Using
Suggested change
|
||||||||||||||
|
|
||||||||||||||
| # Load the clean API data into a pandas table | ||||||||||||||
| api_df = pd.read_csv(io.StringIO(api_response.text)) | ||||||||||||||
|
|
||||||||||||||
| # 2. Get ONLY the iso3 code from the master database | ||||||||||||||
| print("2. Fetching country iso3 codes from WHO master database...") | ||||||||||||||
| master_url = "https://extranet.who.int/tme/generateCSV.asp?ds=notifications" | ||||||||||||||
|
|
||||||||||||||
| # We only pull the 'country' (for matching) and 'iso3' columns | ||||||||||||||
| geo_columns = ['country', 'iso3'] | ||||||||||||||
| master_df = pd.read_csv(master_url, usecols=geo_columns).drop_duplicates() | ||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The call to
Suggested change
|
||||||||||||||
|
|
||||||||||||||
| # 3. Merge the two datasets together based on the country name | ||||||||||||||
| print("3. Merging data and formatting...") | ||||||||||||||
| # The API uses uppercase 'COUNTRY', the master uses lowercase 'country' | ||||||||||||||
| merged_df = pd.merge(api_df, master_df, left_on='COUNTRY', right_on='country', how='left') | ||||||||||||||
|
|
||||||||||||||
| # Drop the duplicate lowercase 'country' column used for joining | ||||||||||||||
| merged_df = merged_df.drop(columns=['country']) | ||||||||||||||
|
|
||||||||||||||
| # Reorder columns so the iso3 code sits right next to the Country name | ||||||||||||||
| final_columns = [ | ||||||||||||||
| 'IND_ID', 'INDICATOR_NAME', 'YEAR', 'COUNTRY', 'iso3', 'VALUE' | ||||||||||||||
| ] | ||||||||||||||
| merged_df = merged_df[final_columns] | ||||||||||||||
|
|
||||||||||||||
| # 4. Save to CSV in a new folder | ||||||||||||||
| output_dir = "statvar_imports/tuberculosis_percentage/input_files" | ||||||||||||||
| filename = os.path.join(output_dir, "tuberculosisPercentage_input.csv") | ||||||||||||||
|
|
||||||||||||||
| os.makedirs(output_dir, exist_ok=True) | ||||||||||||||
|
|
||||||||||||||
| # Save without the pandas index column | ||||||||||||||
| merged_df.to_csv(filename, index=False) | ||||||||||||||
| print(f"Success! Data saved locally as '{filename}'") | ||||||||||||||
|
|
||||||||||||||
| if __name__ == "__main__": | ||||||||||||||
| download_tb_percentage_data() | ||||||||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,27 @@ | ||
| { | ||
| "import_specifications": [ | ||
| { | ||
| "import_name": "tuberculosis percentage", | ||
| "curator_emails": [ | ||
| "support@datacommons.org" | ||
| ], | ||
| "provenance_url": "https://data.who.int/indicators/i/1891124/449F55C", | ||
| "provenance_description": "Percentage of people diagnosed with a new episode of pulmonary TB whose disease was bacteriologically confirmed", | ||
| "scripts": [ | ||
| "download_who_tuberculosis.py", | ||
| "../../tools/statvar_importer/stat_var_processor.py --input_data=source_files/*.csv --pv_map=tuberculosisPercentage_pvmap.csv --config_file=tuberculosisPercentage_metadata.csv --output_path=tuberculosisPercentage_output" | ||
| ], | ||
| "source_files": [ | ||
| "source_files/*.csv" | ||
| ], | ||
| "import_inputs": [ | ||
| { | ||
| "template_mcf": "tuberculosisPercentage_output.tmcf", | ||
| "cleaned_csv": "tuberculosisPercentage_output.csv", | ||
| "stat_var_mcf": "tuberculosisPercentage_output_stat_vars.mcf" | ||
| } | ||
| ], | ||
| "cron_schedule": "0 0 1 1,4,7,10 *" | ||
| } | ||
| ] | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The command to download the input file refers to a non-existent script
tuberculosisPercentage_input.py. It should point to thedownload_who_tuberculosis.pyscript.