dbpedia-GSoC2021
Google Summer of Code 2021 DBpedia Healthcare Platform project by Guang Zhang. Project link: https://github.com/dbpedia/healthcare-platform
Proposal
Proposal for Healthcare Platform
Goals
- Research COVID19 and healthcare datasets, and create mappings with sparql queries, then upload to DBpedia databus
- Contribute to the DBpedia mapping ontologies and healthcare field
- Deploy a (DBpedia interlinked) healthcare dataset on the DBpedia Databus
- Community Release Extension for COVID-19 HEALTHCARE
If times allow:
- Create a Dashboard
- Healthcare QA using this dataset
- demo QA using SPARQL
COVID Mapping for Data Sets
Resources
- Sparql tool: TARQL
- RDF turtle
- Paper for Wikipedia and Wikidata
- CORD-19 RDF dataset release
- CORD-19 Dataset
- Sparql Practice
- Sparql query language
- Learn Sparql: Stardog
- RDF literals
- RDF primer
- QA platform example
- Spiral Model
- DBpedia Web
- Github Repo
- Get familiar with DBpedia Sparql databus
- GSoC 2020 Dashboard
- CORD-19: The Covid-19 Open Research Dataset
- Analyse and compare recency/correctness of Wikidata to recency/correctness of Wikipedia/DBpedia
- SPARQL wikidata
- SPARQL w3
Potential Healthcare Datasets:
- COVID cases and deaths worldwide: Datahub Novel Coronavirus 2019 COVID-19
- Pharmaceutical Drug Spending: Datahub Pharmaceutical Drug Spending
- World Vaccination Progress: Kaggle COVID-19 World Vaccination Progress
- World Vaccine Adverse Reactions: Kaggle COVID-19 World Vaccine Adverse Reactions
- Diabetes: Datahub Diabetes
- Covid-19 Vaccine by Country: Kaggle Latest Worldwide Covid19 Vaccine Data
- Vaccine Preventable by Disease name: Kaggle Vaccine Preventable Diseases
- Kaggle World Vaccination Progress
- Kaggle World Health Statistics 2020
- COVID Mapping
- COVID Worldometer
- COVID 19 dataset by date
- DBpedia Live
Meetings
June 15h 2021 Meeting. Discussed sparql, rdf turtle tools, articles reading. Tasks to do:
- Explore Tarql mapping tools
- Read articles about wikipedia, wikidata, CORD-19 dataset
- Get familiar with sparql query
June 23rd 2021 Meeting. Discussed DBpedia ontology and json2rdf tools
- dbpedia ontology Disease
- JSON2RDF Maven Repo
- CORD-19 RDF Covid-on-the-web
- Dublin Core
- DBLP Tasks to do:
- Search for RDF and CSV datasets on COVID and Healthcare
- Datahubs: zenodo and datahub
- Update mappings for current COVID and Healthcare resources (i.e. COVID and COVID-Symptoms)
- Learn different dbpedia types (i.e. dbo, dbp, dbr, dbo, dbt, dbc)
July 7th 2021 Meeting. Tasks to do:
- TARQL mapping from CSV to RDF
- Existing mappings updates (e.g. pendamicDeaths to pandemicDeaths)
- Questions/Info relating to Heathcare/COVID (e.g which vaccines are administered in each country?)
- Search Wikipedia and DBpedia resources in the healthcare field; check mappings
- Build sparql query for searching healthcare info
July 13th, July 16th 2021 Meetings. Tasks to do:
- Describe statistics for the fields of the CSV files (e.g. Confirmed Cases and Deaths by Country, Vaccines and Total Vaccine Doses by Country)
- Check existing mapping in healthcare, COVID, and create new mappings if not existed or linked
- Continue mapping for the CSV files using Tarql
July 22nd 2021 Meetings. Discussed Data set statistics, DBpedia mappings and ontology. Tasks to do:
- Create Mappings for the Kaggle COVID-19 World Vaccination Progress
- Check existing mapping in healthcare, COVID, and create new mappings if not existed or linked
- Continue mapping for the CSV files using Tarql
July 26th 2021 Meeting. Discussed Tarql Mapping, Databus upload. Tasks to do:
- Convert Git repo on DBpedia (Heath platform) to Git LFS
- Learn lbzip2 compression
- Upload the compressed data set with Git LFS to the DBpedia repo
- Learn DBpedia databus
- Upload the World Vaccination Progress with Databus
August 2nd 2021 Meeting. Set up ssh dbpedia server and SSH key together; set up webid for the Databus upload Tasks to do:
- Try again for the DBpedia databus upload
- Prepare data sets for healthcare
- Try again maven databus
August 5th 2021 Meeting. Discussed Tarql Mappings Tasks to do:
- Use Kaggle API for downloading data sets
- Continue mapping
August 9th 2021 Meeting. Discussed Kaggle API and DBpedia databus upload Tasks to do:
- Shell scripting and web scrapping for checking kaggle versions
- Download data set only if newer version is available
- Fix Tarql Mapping
- Check again DBpedia web upload for databus
August 12th 2021 Meeting. Discussed missing webid files and fixed the issue by taking backup copy; Discussed cronjob for auto-updating Tasks to do:
- continue version checking for kaggle data sets
- re-organize data sets and databus-upload (e.g. raw for csv files only, namings of folders and data sets)
- check dbpedia mappings contributions
August 16th 2021 Meeting. Discussed DBpedia databus progress; schedules and plans for GSoC2021 submission Tasks to do:
- Tarql mapping for data sets
- Fix mapping issue
- Continue version checking for kaggle data sets
August 18th 2021 Meeting. Discussed databus-upload folder structure, github repo, GSoC2021 submission Tasks to do:
- Re-organize folder structure
- Download each data set again, only csv files, put them into 2021.08.18 folders
- Run tarql mappings for all again, and put them into folder “input”
- Rename them based on the example (_tag=default.csv.bz2)
- Enter empty pom.xml files and Description .md files
- Github repo re-organize