PostgreSQL Data Analyses
Load data into PostgreSQL
1 2 3 4 5 6 7 8 |
cd /var/tmp/ wget http://avinton.com/wp-content/uploads/2016/07/japan_cities.tsv sudo su su postgres psql -d avinton create table japan_cities (prefecture text, city text, ward text, population integer); copy japan_cities from '/var/tmp/japan_cities.tsv' NULL 'NULL'; \q |
Data Cleaning
Connect to PostgreSQL DB using PGAdmin and run the following to clean the data
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
create table japan_cities2 as ( select a.*, row_number() over() as id from japan_cities as a ); drop table japan_cities; create table japan_cities as ( select case when prefecture is null then (select c.prefecture from japan_cities2 as c where c.id = (select max(b.id) from japan_cities2 b where b.id < a.id and b.prefecture is not null)) end as prefecture, (select c.city from japan_cities2 as c where c.id = (select max(b.id) from japan_cities2 b where b.id <= a.id and b.city is not null)) as city, ward, population from japan_cities2 as a); drop table japan_cities2; delete from japan_cities where population is null; alter table japan_cities add column city_ward text; update japan_cities set city_ward = 'c' where ward is null; update japan_cities set city_ward = 'w' where ward is not null; select * from japan_cities; |
Exercises
Have a look at the data first and try to understand its structure. Then try these exercises below:
1. Which is the ward with the highest population? (hint: try using order by and limit clause Limit)
2. What is the standard deviation for the city populations?
3. How many cities are there in Hokkaido?
4. How many wards are there in total in Japan?
5. list the prefectures and their corresponding population
6. list the prefectures only in order of descending population
7. list the prefectures and the corresponding city within it with the highest population.
Challenge:
Try to use PostgreSQL’s windowing functions to generate the following:
For each city show the prefecture, city, smallest_ward, largest_ward, total_city_population.
The table should contain 1 line per city and be ordered by city population in descending order.