반응형
In [145]:
import pandas as pd
Data Load¶
In [146]:
seoul = pd.read_csv('./data/seoul.csv')
daegu = pd.read_csv('./data/daegu.csv')
print(seoul.columns)
original_col = seoul.columns
Index(['날짜', '지점', '평균기온(℃)', '최저기온(℃)', '최고기온(℃)'], dtype='object')
기호가 들어있는 컬럼명은 이용하기 불편하므로 바꾸어줍니다.
In [147]:
col = ['date', 'location', 'average temp', 'lowest temp', 'highest temp']
seoul.columns = col
daegu.columns = col
다른 방법 : rename이용
seoul.rename({'날짜' : 'date',
'지점' : 'location',
'평균기온(℃)' : 'avarage temp',
'최저기온(℃)' : 'lowest temp',
'최고기온(℃)' : 'highest temp'},
axis=1, inplace=True)
daegu.rename({'날짜' : 'date',
'지점' : 'location',
'평균기온(℃)' : 'avarage temp',
'최저기온(℃)' : 'lowest temp',
'최고기온(℃)' : 'highest temp'},
axis=1, inplace=True)
In [148]:
seoul
Out[148]:
date | location | average temp | lowest temp | highest temp | |
---|---|---|---|---|---|
0 | 1907-10-01 | 108 | 13.5 | 7.9 | 20.7 |
1 | 1907-10-02 | 108 | 16.2 | 7.9 | 22.0 |
2 | 1907-10-03 | 108 | 16.2 | 13.1 | 21.3 |
3 | 1907-10-04 | 108 | 16.5 | 11.2 | 22.0 |
4 | 1907-10-05 | 108 | 17.6 | 10.9 | 25.4 |
... | ... | ... | ... | ... | ... |
41374 | 2022-03-16 | 108 | 10.3 | 3.6 | 16.8 |
41375 | 2022-03-17 | 108 | 10.7 | 7.8 | 15.1 |
41376 | 2022-03-18 | 108 | 7.5 | 5.5 | 9.6 |
41377 | 2022-03-19 | 108 | 3.4 | 0.9 | 6.1 |
41378 | 2022-03-20 | 108 | 5.4 | 2.2 | 9.3 |
41379 rows × 5 columns
In [149]:
daegu
Out[149]:
date | location | average temp | lowest temp | highest temp | |
---|---|---|---|---|---|
0 | 1907-01-31 | 143 | NaN | -7.0 | 0.8 |
1 | 1907-02-01 | 143 | NaN | NaN | NaN |
2 | 1907-02-02 | 143 | NaN | NaN | NaN |
3 | 1907-02-03 | 143 | NaN | NaN | NaN |
4 | 1907-02-04 | 143 | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... |
42048 | 2022-03-16 | 143 | 14.4 | 5.1 | 23.0 |
42049 | 2022-03-17 | 143 | 8.8 | 6.3 | 13.6 |
42050 | 2022-03-18 | 143 | 6.1 | 4.3 | 8.1 |
42051 | 2022-03-19 | 143 | 4.9 | 3.0 | 8.1 |
42052 | 2022-03-20 | 143 | 6.6 | 2.0 | 11.0 |
42053 rows × 5 columns
1. 기상 관측 이래, 서울의 최고 기온이 가장 높았던 날은 언제였고, 몇도인가요?¶
In [150]:
hottest_day = seoul[seoul['highest temp']==seoul['highest temp'].max()][['date','highest temp']]
hottest_day
Out[150]:
date | highest temp | |
---|---|---|
40051 | 2018-08-01 | 39.6 |
2. 역사적으로 일교차가 가장 큰 날짜는 몇년 몇월 몇일 인가요?¶
In [151]:
seoul['Gap'] = seoul['highest temp'] - seoul['lowest temp']
highest_gap = seoul[seoul['Gap']==seoul['Gap'].max()][['date','highest temp']]
highest_gap
Out[151]:
date | highest temp | |
---|---|---|
12619 | 1942-04-19 | 24.3 |
3. 1년 중 평균적으로 일교차가 가장 큰 날짜는 몇월 몇일 인가요?¶
In [152]:
seoul.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 41379 entries, 0 to 41378 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 41379 non-null object 1 location 41379 non-null int64 2 average temp 40623 non-null float64 3 lowest temp 40621 non-null float64 4 highest temp 40620 non-null float64 5 Gap 40620 non-null float64 dtypes: float64(4), int64(1), object(1) memory usage: 1.9+ MB
pd.str.slice를 이용해서 date에서 날짜만 슬라이스해 새로 day라는 컬럼을 만들어 줍니다.
In [153]:
seoul['day']=seoul['date'].str.slice(start=5)
display(seoul)
date | location | average temp | lowest temp | highest temp | Gap | day | |
---|---|---|---|---|---|---|---|
0 | 1907-10-01 | 108 | 13.5 | 7.9 | 20.7 | 12.8 | 10-01 |
1 | 1907-10-02 | 108 | 16.2 | 7.9 | 22.0 | 14.1 | 10-02 |
2 | 1907-10-03 | 108 | 16.2 | 13.1 | 21.3 | 8.2 | 10-03 |
3 | 1907-10-04 | 108 | 16.5 | 11.2 | 22.0 | 10.8 | 10-04 |
4 | 1907-10-05 | 108 | 17.6 | 10.9 | 25.4 | 14.5 | 10-05 |
... | ... | ... | ... | ... | ... | ... | ... |
41374 | 2022-03-16 | 108 | 10.3 | 3.6 | 16.8 | 13.2 | 03-16 |
41375 | 2022-03-17 | 108 | 10.7 | 7.8 | 15.1 | 7.3 | 03-17 |
41376 | 2022-03-18 | 108 | 7.5 | 5.5 | 9.6 | 4.1 | 03-18 |
41377 | 2022-03-19 | 108 | 3.4 | 0.9 | 6.1 | 5.2 | 03-19 |
41378 | 2022-03-20 | 108 | 5.4 | 2.2 | 9.3 | 7.1 | 03-20 |
41379 rows × 7 columns
In [154]:
ave_gap = seoul.groupby('day')['Gap'].mean()
ave_gap
Out[154]:
day 01-01 7.964286 01-02 7.771429 01-03 8.366071 01-04 8.597321 01-05 8.419643 ... 12-27 7.772321 12-28 7.711607 12-29 8.321429 12-30 8.402679 12-31 7.958036 Name: Gap, Length: 366, dtype: float64
In [155]:
max_ave_gap = ave_gap[ave_gap==ave_gap.max()]
max_ave_gap
Out[155]:
day 04-29 11.928829 Name: Gap, dtype: float64
4. 가장 덥다고 알려진 대구보다 서울이 더 더운날이 가장 많은 연도는 언제인가요?¶
In [156]:
daegu_droped = daegu[['date', 'highest temp']]
daegu_droped
Out[156]:
date | highest temp | |
---|---|---|
0 | 1907-01-31 | 0.8 |
1 | 1907-02-01 | NaN |
2 | 1907-02-02 | NaN |
3 | 1907-02-03 | NaN |
4 | 1907-02-04 | NaN |
... | ... | ... |
42048 | 2022-03-16 | 23.0 |
42049 | 2022-03-17 | 13.6 |
42050 | 2022-03-18 | 8.1 |
42051 | 2022-03-19 | 8.1 |
42052 | 2022-03-20 | 11.0 |
42053 rows × 2 columns
In [157]:
seoul_droped = seoul[['date', 'highest temp']]
seoul_droped
Out[157]:
date | highest temp | |
---|---|---|
0 | 1907-10-01 | 20.7 |
1 | 1907-10-02 | 22.0 |
2 | 1907-10-03 | 21.3 |
3 | 1907-10-04 | 22.0 |
4 | 1907-10-05 | 25.4 |
... | ... | ... |
41374 | 2022-03-16 | 16.8 |
41375 | 2022-03-17 | 15.1 |
41376 | 2022-03-18 | 9.6 |
41377 | 2022-03-19 | 6.1 |
41378 | 2022-03-20 | 9.3 |
41379 rows × 2 columns
In [158]:
daegu_droped.columns = ['date', 'daegu highest temp']
seoul_droped.columns = ['date', 'seoul highest temp']
display(daegu_droped)
display(seoul_droped)
date | daegu highest temp | |
---|---|---|
0 | 1907-01-31 | 0.8 |
1 | 1907-02-01 | NaN |
2 | 1907-02-02 | NaN |
3 | 1907-02-03 | NaN |
4 | 1907-02-04 | NaN |
... | ... | ... |
42048 | 2022-03-16 | 23.0 |
42049 | 2022-03-17 | 13.6 |
42050 | 2022-03-18 | 8.1 |
42051 | 2022-03-19 | 8.1 |
42052 | 2022-03-20 | 11.0 |
42053 rows × 2 columns
date | seoul highest temp | |
---|---|---|
0 | 1907-10-01 | 20.7 |
1 | 1907-10-02 | 22.0 |
2 | 1907-10-03 | 21.3 |
3 | 1907-10-04 | 22.0 |
4 | 1907-10-05 | 25.4 |
... | ... | ... |
41374 | 2022-03-16 | 16.8 |
41375 | 2022-03-17 | 15.1 |
41376 | 2022-03-18 | 9.6 |
41377 | 2022-03-19 | 6.1 |
41378 | 2022-03-20 | 9.3 |
41379 rows × 2 columns
In [159]:
temp_df = pd.merge(daegu_droped, seoul_droped,
on='date',
how='inner')
temp_df
Out[159]:
date | daegu highest temp | seoul highest temp | |
---|---|---|---|
0 | 1907-10-01 | NaN | 20.7 |
1 | 1907-10-02 | NaN | 22.0 |
2 | 1907-10-03 | NaN | 21.3 |
3 | 1907-10-04 | NaN | 22.0 |
4 | 1907-10-05 | NaN | 25.4 |
... | ... | ... | ... |
41374 | 2022-03-16 | 23.0 | 16.8 |
41375 | 2022-03-17 | 13.6 | 15.1 |
41376 | 2022-03-18 | 8.1 | 9.6 |
41377 | 2022-03-19 | 8.1 | 6.1 |
41378 | 2022-03-20 | 11.0 | 9.3 |
41379 rows × 3 columns
In [160]:
print('Null 값의 개수')
temp_df.isnull().sum() # Nan 값은 비교할 수 없다.
Null 값의 개수
Out[160]:
date 0 daegu highest temp 413 seoul highest temp 759 dtype: int64
In [161]:
temp_df.dropna(axis=0, inplace=True) # Nan 값은 지워줍니다.
In [162]:
temp_df['comparison'] = temp_df['daegu highest temp']<temp_df['seoul highest temp']
In [163]:
temp_df['year'] = temp_df['date'].str.slice(stop=4)
In [164]:
temp_df
Out[164]:
date | daegu highest temp | seoul highest temp | comparison | year | |
---|---|---|---|---|---|
10 | 1907-10-11 | 27.0 | 20.4 | False | 1907 |
16 | 1907-10-17 | 22.4 | 21.6 | False | 1907 |
19 | 1907-10-20 | 20.0 | 22.7 | True | 1907 |
22 | 1907-10-23 | 21.8 | 16.3 | False | 1907 |
31 | 1907-11-01 | 23.6 | 21.3 | False | 1907 |
... | ... | ... | ... | ... | ... |
41374 | 2022-03-16 | 23.0 | 16.8 | False | 2022 |
41375 | 2022-03-17 | 13.6 | 15.1 | True | 2022 |
41376 | 2022-03-18 | 8.1 | 9.6 | True | 2022 |
41377 | 2022-03-19 | 8.1 | 6.1 | False | 2022 |
41378 | 2022-03-20 | 11.0 | 9.3 | False | 2022 |
40209 rows × 5 columns
In [165]:
df = temp_df.groupby('year')['comparison'].sum()
df
Out[165]:
year 1907 1 1908 7 1909 52 1910 81 1911 71 .. 2018 98 2019 94 2020 86 2021 95 2022 3 Name: comparison, Length: 114, dtype: int64
In [171]:
df[df==df.max()].index[0]
Out[171]:
'1930'
In [172]:
climate = pd.read_csv('./data/climate data.csv')
--------------------------------------------------------------------------- UnicodeDecodeError Traceback (most recent call last) Input In [172], in <cell line: 1>() ----> 1 climate = pd.read_csv('./data/climate data.csv') File C:\ProgramData\Anaconda3\envs\machine\lib\site-packages\pandas\util\_decorators.py:311, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs) 305 if len(args) > num_allow_args: 306 warnings.warn( 307 msg.format(arguments=arguments), 308 FutureWarning, 309 stacklevel=stacklevel, 310 ) --> 311 return func(*args, **kwargs) File C:\ProgramData\Anaconda3\envs\machine\lib\site-packages\pandas\io\parsers\readers.py:680, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options) 665 kwds_defaults = _refine_defaults_read( 666 dialect, 667 delimiter, (...) 676 defaults={"delimiter": ","}, 677 ) 678 kwds.update(kwds_defaults) --> 680 return _read(filepath_or_buffer, kwds) File C:\ProgramData\Anaconda3\envs\machine\lib\site-packages\pandas\io\parsers\readers.py:575, in _read(filepath_or_buffer, kwds) 572 _validate_names(kwds.get("names", None)) 574 # Create the parser. --> 575 parser = TextFileReader(filepath_or_buffer, **kwds) 577 if chunksize or iterator: 578 return parser File C:\ProgramData\Anaconda3\envs\machine\lib\site-packages\pandas\io\parsers\readers.py:933, in TextFileReader.__init__(self, f, engine, **kwds) 930 self.options["has_index_names"] = kwds["has_index_names"] 932 self.handles: IOHandles | None = None --> 933 self._engine = self._make_engine(f, self.engine) File C:\ProgramData\Anaconda3\envs\machine\lib\site-packages\pandas\io\parsers\readers.py:1231, in TextFileReader._make_engine(self, f, engine) 1228 f = self.handles.handle 1230 try: -> 1231 return mapping[engine](f, **self.options) 1232 except Exception: 1233 if self.handles is not None: File C:\ProgramData\Anaconda3\envs\machine\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py:75, in CParserWrapper.__init__(self, src, **kwds) 72 kwds.pop(key, None) 74 kwds["dtype"] = ensure_dtype_objs(kwds.get("dtype", None)) ---> 75 self._reader = parsers.TextReader(src, **kwds) 77 self.unnamed_cols = self._reader.unnamed_cols 79 # error: Cannot determine type of 'names' File C:\ProgramData\Anaconda3\envs\machine\lib\site-packages\pandas\_libs\parsers.pyx:544, in pandas._libs.parsers.TextReader.__cinit__() File C:\ProgramData\Anaconda3\envs\machine\lib\site-packages\pandas\_libs\parsers.pyx:633, in pandas._libs.parsers.TextReader._get_header() File C:\ProgramData\Anaconda3\envs\machine\lib\site-packages\pandas\_libs\parsers.pyx:847, in pandas._libs.parsers.TextReader._tokenize_rows() File C:\ProgramData\Anaconda3\envs\machine\lib\site-packages\pandas\_libs\parsers.pyx:1952, in pandas._libs.parsers.raise_parser_error() UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb1 in position 0: invalid start byte
In [ ]:
반응형
'Programming > Pandas' 카테고리의 다른 글
[pandas] read_csv 'utf-8' error (1) | 2022.03.21 |
---|---|
[Pandas] Movie Lens Data를 이용한 EDA 실습 (1) | 2022.03.21 |
[pandas] DataFrame Merge, Mapping, Grouping (0) | 2022.03.21 |
[pandas] DataFrame Indexing & Slicing (0) | 2022.03.17 |
[pandas] 여러가지 resource를 이용하여 DataFrame 생성하기 (csv, sql, api, json) (0) | 2022.03.17 |