파이썬 날짜 분리 - 시계열 데이터 활용1 [python/pandas]

파이썬/판다스

파이썬 날짜 분리 - 시계열 데이터 활용1 [python/pandas]

leon_choi 2022. 8. 11. 18:55

앞서 소개해드린 timestamp와 poried배열을 소개해드렸는데요. 이를 토대로 날짜 데이터를 분리해보도록하겠습니다. 예제는 주식 거래 데이터를 활용해보겠습니다.

import pandas as pd

df = pd.read_csv('stock-data.csv')
print(df.head(),'\n')
print(df.info(),'\n')

   Date  Close  Start   High    Low  Volume
0  2018-07-02  10100  10850  10900  10000  137977
1  2018-06-29  10700  10550  10900   9990  170253
2  2018-06-28  10400  10900  10950  10150  155769
3  2018-06-27  10900  10800  11050  10500  133548
4  2018-06-26  10800  10900  11000  10700   63039 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Date    20 non-null     object
 1   Close   20 non-null     int64
 2   Start   20 non-null     int64
 3   High    20 non-null     int64
 4   Low     20 non-null     int64
 5   Volume  20 non-null     int64
dtypes: int64(5), object(1)
memory usage: 1.1+ KB
None

네 이렇게 주식 거래 데이터를 확인해봤는데요. 오늘 활용하고자 하는 Date 컬럼이 문자열로 돼 있는 것을 확인할 수 있습니다. 타입이 문자열로 돼 있으면 데이터 분석을 하는 데 있어 많은 지장을 주는데요. 그렇기에 Date 컬럼의 타입을 변경해야 연도별, 월별, 일별 등 데이터분석에 있어 기본적이라고 할 수 있는 기간별 분석들을 좀 더 용이하게 할 수 있습니다. 그렇다면 이제 Date 컬럼 타입을 변경하고 dt 속성을 이용해 연도, 월, 일자를 쪼개는 것도 해보겠습니다.

#Date 타입 변경
df['new_date'] = pd.to_datetime(df['Date'])
print(df.head(),'\n')
print(df.info(),'\n')

#dt 속성을 사용해 new_date의 연도, 월, 일 쪼개기
df['year'] = df['new_date'].dt.year
df['month'] = df['new_date'].dt.month
df['day'] = df['new_date'].dt.day
print(df.head(),'\n')

        Date  Close  Start   High    Low  Volume   new_date
0  2018-07-02  10100  10850  10900  10000  137977 2018-07-02
1  2018-06-29  10700  10550  10900   9990  170253 2018-06-29
2  2018-06-28  10400  10900  10950  10150  155769 2018-06-28
3  2018-06-27  10900  10800  11050  10500  133548 2018-06-27
4  2018-06-26  10800  10900  11000  10700   63039 2018-06-26 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   Date      20 non-null     object
 1   Close     20 non-null     int64
 2   Start     20 non-null     int64
 3   High      20 non-null     int64
 4   Low       20 non-null     int64
 5   Volume    20 non-null     int64
 6   new_date  20 non-null     datetime64[ns]
dtypes: datetime64[ns](1), int64(5), object(1)
memory usage: 1.2+ KB
None

         Date  Close  Start   High    Low  Volume   new_date  year  month  day
0  2018-07-02  10100  10850  10900  10000  137977 2018-07-02  2018      7    2
1  2018-06-29  10700  10550  10900   9990  170253 2018-06-29  2018      6   29
2  2018-06-28  10400  10900  10950  10150  155769 2018-06-28  2018      6   28
3  2018-06-27  10900  10800  11050  10500  133548 2018-06-27  2018      6   27
4  2018-06-26  10800  10900  11000  10700   63039 2018-06-26  2018      6   26

결과를 표출했습니다. Date 컬럼의 타입을 변경한 new_date 컬럼을 보면 타입이 변경된 것을 확인할 수 있습니다. 또한 dt속성을 활용해 연도,월,일 역시 쪼개서 표출했습니다. 그렇.다면 이제는 period로 변환을 해보려고 하는데요. 이때는 dt 속성에 .to_period()매소드를 적용해 추출할 수 있습니다. 코딩을 보죠.

#timestamp period로 변환

df['date_yr']= df['new_date'].dt.to_period(freq='A')
df['date_m']= df['new_date'].dt.to_period(freq='M')
df['date_d']= df['new_date'].dt.to_period(freq='D')
print(df.head(),'\n')

         Date  Close  Start   High    Low  Volume   new_date  year  month  day date_yr   date_m      date_d
0  2018-07-02  10100  10850  10900  10000  137977 2018-07-02  2018      7    2    2018  2018-07  2018-07-02
1  2018-06-29  10700  10550  10900   9990  170253 2018-06-29  2018      6   29    2018  2018-06  2018-06-29
2  2018-06-28  10400  10900  10950  10150  155769 2018-06-28  2018      6   28    2018  2018-06  2018-06-28
3  2018-06-27  10900  10800  11050  10500  133548 2018-06-27  2018      6   27    2018  2018-06  2018-06-27
4  2018-06-26  10800  10900  11000  10700   63039 2018-06-26  2018      6   26    2018  2018-06  2018-06-26

네 이렇게 dt.to_period()매서드를 통해서도 날짜를 분리해봤습니다. 다음편에서는 이렇게 날짜를 분리하는 것을 인덱스로 활용하는 방법에 대해 살펴 보도록 하겠습니다.