Friday, March 24, 2017

Air Quality In Poland #06 - Big Data Frame part 2

Since we know how to restructure our data frames and how to concatenate them in proper way, we may start with building one big data frame. To select interesting pollutants and years of measurements we have to build two lists:

 pollutants = importantPollutants  
 years = sorted(list(dataFiles["year"].unique()))  

and then we have to run two nested loops which will walk over relevant files and concatenate or merge generated data frames

1:  bigDataFrame = pd.DataFrame()  
2:  for dataYear in years:   
3:    print(dataYear)  
4:    yearDataFrame = pd.DataFrame()  
5:    for index, dataRow in tqdm(pollutantsYears[pollutantsYears["year"] == dataYear].iterrows(), total=len(pollutantsYears[pollutantsYears["year"] == dataYear].index)):  
6:      data = pd.read_excel("../input/" + dataRow["filename"] + ".xlsx", skiprows=[1,2])  
7:      data = data.rename(columns={"Kod stacji":"Hour"})  
8:    
9:      year = int(dataRow["year"])  
10:      rng = pd.date_range(start = str(year) + '-01-01 01:00:00', end = str(year+1) + '-01-01 00:00:00', freq='H')  
11:    
12:      # workaround for 2006_PM2.5_1g, 2012_PM10_1g, 2012_O3_1g  
13:      try:  
14:        data["Hour"] = rng  
15:      except ValueError:  
16:        print("File {} has some mess with timestamps".format(dataRow["filename"]))  
17:        continue  
18:    
19:      data = data.set_index("Hour")  
20:      data = data.stack()  
21:      data = pd.DataFrame(data, columns=[dataRow["pollutant"]])  
22:      data.index.set_names(['Hour', 'Station'], inplace=True)  
23:    
24:      yearDataFrame = pd.concat([yearDataFrame, data], axis=1)  
25:      
26:    bigDataFrame = bigDataFrame.append(yearDataFrame)  

This code is more or less the same as in previous post, but you can see some differences. Line (10) is for generating time indexes for different year each time, so we don't have to care of leap year. Lines (12-17) on the other hand cover problems with 3 data files which are not starting on first hour of new year. Perhaps I will take care of them later, for now they are ignored.

Why do we need nested loops? I previous post I worked towards merging  data frames containing different pollutants into one. So if we have such data frame we just need to append it to our target big data frame.

After creating data frame with all interesting data points we should save it to disk, so later we will only have to read it in order to start analysis.

  bigDataFrame.to_pickle("../output/bigDataFrame.pkl")  

Thats all for today, thanks for reading!

No comments:

Post a Comment