Pandas tutorial 1

데이터 전처리 프로세스

  • 중복값 제거 및 결측치 처리
    • 완전 무작위(MCAR) / 무작위(MAR) / 비무작위(NMAR)
    • 제거, 치환, 모델 기반 처리 등
  • 이상치 탐지 및 처리
    • 삭제, 대체, 변환(스케일링) 등
  • Feature Engineering
    • 정규화, 표준화, 로그변환, 벡터화 등
    • PCA, EFA 등을 통한 차원 축소

Pandas 라이브러리

Pandas의 기본 자료형

  • Series 객체, DataFrame 객체
    • Index: 숫자 또는 문자, 중복X
    • Series: Index & Column 1개
    • DataFrame: Index & Column 2개 이상
  • 각 객체에 따라 사용 가능한 method가 상이함

Pandas 라이브러리 불러오기

1
2
import pandas as pd
print(pd.__version__)
1.3.5

테스트

1
2
3
4
5
# DataFrame 객체
temp_dic = {'col1' : [1, 2, 3], 'col2' : [4, 5, 6]}
df = pd.DataFrame(temp_dic)
print(df)
print(type(df))
   col1  col2
0     1     4
1     2     5
2     3     6
<class 'pandas.core.frame.DataFrame'>
1
2
3
4
5
# Series 객체
temp_dic = {'a':1, 'b':2, 'c':3}
ser = pd.Series(temp_dic)
print(ser)
print(type(ser))
a    1
b    2
c    3
dtype: int64
<class 'pandas.core.series.Series'>

구글 드라이브 연동

1
2
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive

데이터 불러오기

1
2
3
DATA_PATH = '/content/drive/MyDrive/Colab Notebooks/Data/Lemonade2016.csv'
juice = pd.read_csv(DATA_PATH)
print(juice)
         Date Location  Lemon  Orange  Temperature  Leaflets  Price
0    7/1/2016     Park     97      67           70      90.0   0.25
1    7/2/2016     Park     98      67           72      90.0   0.25
2    7/3/2016     Park    110      77           71     104.0   0.25
3    7/4/2016    Beach    134      99           76      98.0   0.25
4    7/5/2016    Beach    159     118           78     135.0   0.25
5    7/6/2016    Beach    103      69           82      90.0   0.25
6    7/6/2016    Beach    103      69           82      90.0   0.25
7    7/7/2016    Beach    143     101           81     135.0   0.25
8         NaN    Beach    123      86           82     113.0   0.25
9    7/9/2016    Beach    134      95           80     126.0   0.25
10  7/10/2016    Beach    140      98           82     131.0   0.25
11  7/11/2016    Beach    162     120           83     135.0   0.25
12  7/12/2016    Beach    130      95           84      99.0   0.25
13  7/13/2016    Beach    109      75           77      99.0   0.25
14  7/14/2016    Beach    122      85           78     113.0   0.25
15  7/15/2016    Beach     98      62           75     108.0   0.50
16  7/16/2016    Beach     81      50           74      90.0   0.50
17  7/17/2016    Beach    115      76           77     126.0   0.50
18  7/18/2016     Park    131      92           81     122.0   0.50
19  7/19/2016     Park    122      85           78     113.0   0.50
20  7/20/2016     Park     71      42           70       NaN   0.50
21  7/21/2016     Park     83      50           77      90.0   0.50
22  7/22/2016     Park    112      75           80     108.0   0.50
23  7/23/2016     Park    120      82           81     117.0   0.50
24  7/24/2016     Park    121      82           82     117.0   0.50
25  7/25/2016     Park    156     113           84     135.0   0.50
26  7/26/2016     Park    176     129           83     158.0   0.35
27  7/27/2016     Park    104      68           80      99.0   0.35
28  7/28/2016     Park     96      63           82      90.0   0.35
29  7/29/2016     Park    100      66           81      95.0   0.35
30  7/30/2016    Beach     88      57           82      81.0   0.35
31  7/31/2016    Beach     76      47           82      68.0   0.35
1
2
# 전체적인 구조, 결측치 개수, 데이터 타입 파악
juice.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Date         31 non-null     object 
 1   Location     32 non-null     object 
 2   Lemon        32 non-null     int64  
 3   Orange       32 non-null     int64  
 4   Temperature  32 non-null     int64  
 5   Leaflets     31 non-null     float64
 6   Price        32 non-null     float64
dtypes: float64(2), int64(3), object(2)
memory usage: 1.9+ KB
1
2
3
print(juice.head())
print("-------------------------------------------------------------------")
print(juice.tail())
       Date Location  Lemon  Orange  Temperature  Leaflets  Price
0  7/1/2016     Park     97      67           70      90.0   0.25
1  7/2/2016     Park     98      67           72      90.0   0.25
2  7/3/2016     Park    110      77           71     104.0   0.25
3  7/4/2016    Beach    134      99           76      98.0   0.25
4  7/5/2016    Beach    159     118           78     135.0   0.25
-------------------------------------------------------------------
         Date Location  Lemon  Orange  Temperature  Leaflets  Price
27  7/27/2016     Park    104      68           80      99.0   0.35
28  7/28/2016     Park     96      63           82      90.0   0.35
29  7/29/2016     Park    100      66           81      95.0   0.35
30  7/30/2016    Beach     88      57           82      81.0   0.35
31  7/31/2016    Beach     76      47           82      68.0   0.35
  • describe() : 기술통계량 확인 (int형, float형 변수)
  • value_counts() : 범주형 변수 빈도 수 확인
1
2
print(juice.describe())
print("** type(juice.describe()) :", type(juice.describe())) # DataFrame 객체로 반환
            Lemon      Orange  Temperature    Leaflets      Price
count   32.000000   32.000000    32.000000   31.000000  32.000000
mean   116.156250   80.000000    78.968750  108.548387   0.354687
std     25.823357   21.863211     4.067847   20.117718   0.113137
min     71.000000   42.000000    70.000000   68.000000   0.250000
25%     98.000000   66.750000    77.000000   90.000000   0.250000
50%    113.500000   76.500000    80.500000  108.000000   0.350000
75%    131.750000   95.000000    82.000000  124.000000   0.500000
max    176.000000  129.000000    84.000000  158.000000   0.500000
** type(juice.describe()) : <class 'pandas.core.frame.DataFrame'>
1
2
print(juice['Location'].value_counts())
print("** type(juice.describe()) :", type(juice['Location'].value_counts())) # Series 객체로 반환
Beach    17
Park     15
Name: Location, dtype: int64
** type(juice.describe()) : <class 'pandas.core.series.Series'>

데이터 다루기

1
2
juice['Sold'] = 0
juice.head()

Date Location Lemon Orange Temperature Leaflets Price Sold
0 7/1/2016 Park 97 67 70 90.0 0.25 0
1 7/2/2016 Park 98 67 72 90.0 0.25 0
2 7/3/2016 Park 110 77 71 104.0 0.25 0
3 7/4/2016 Beach 134 99 76 98.0 0.25 0
4 7/5/2016 Beach 159 118 78 135.0 0.25 0

  <script>
    const buttonEl =
      document.querySelector('#df-d7644e91-b826-4359-8387-1580b5266658 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-d7644e91-b826-4359-8387-1580b5266658');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>
1
2
juice['Sold'] = juice['Lemon'] + juice['Orange']
juice.head()

Date Location Lemon Orange Temperature Leaflets Price Sold
0 7/1/2016 Park 97 67 70 90.0 0.25 164
1 7/2/2016 Park 98 67 72 90.0 0.25 165
2 7/3/2016 Park 110 77 71 104.0 0.25 187
3 7/4/2016 Beach 134 99 76 98.0 0.25 233
4 7/5/2016 Beach 159 118 78 135.0 0.25 277

  <script>
    const buttonEl =
      document.querySelector('#df-2d21cec2-5539-4100-b9f5-ac71828a5d25 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-2d21cec2-5539-4100-b9f5-ac71828a5d25');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>
  • 매출액(Revenue) = 가격(Price) * 판매량(Sold)
1
2
juice['Revenue'] = juice['Price'] * juice['Sold']
juice.head()

Date Location Lemon Orange Temperature Leaflets Price Sold Revenue
0 7/1/2016 Park 97 67 70 90.0 0.25 164 41.00
1 7/2/2016 Park 98 67 72 90.0 0.25 165 41.25
2 7/3/2016 Park 110 77 71 104.0 0.25 187 46.75
3 7/4/2016 Beach 134 99 76 98.0 0.25 233 58.25
4 7/5/2016 Beach 159 118 78 135.0 0.25 277 69.25

  <script>
    const buttonEl =
      document.querySelector('#df-506a14b7-9d00-4793-9a3f-5c54252c47b5 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-506a14b7-9d00-4793-9a3f-5c54252c47b5');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>
  • 행 또는 열 제거 : drop(axis=0|1)
    • axis=0 : 행 방향(index) 실행
    • axis=1 : 열 방향(column) 실행
1
2
juice_col_drop = juice.drop('Sold', axis=1)
juice_col_drop.head()

Date Location Lemon Orange Temperature Leaflets Price Revenue
0 7/1/2016 Park 97 67 70 90.0 0.25 41.00
1 7/2/2016 Park 98 67 72 90.0 0.25 41.25
2 7/3/2016 Park 110 77 71 104.0 0.25 46.75
3 7/4/2016 Beach 134 99 76 98.0 0.25 58.25
4 7/5/2016 Beach 159 118 78 135.0 0.25 69.25

  <script>
    const buttonEl =
      document.querySelector('#df-4ee85edb-efbb-47d3-84c6-8b0cab09eb0f button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-4ee85edb-efbb-47d3-84c6-8b0cab09eb0f');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>
1
2
juice_ind_drop = juice.drop(2, axis=0)
juice_ind_drop.head()

Date Location Lemon Orange Temperature Leaflets Price Sold Revenue
0 7/1/2016 Park 97 67 70 90.0 0.25 164 41.00
1 7/2/2016 Park 98 67 72 90.0 0.25 165 41.25
3 7/4/2016 Beach 134 99 76 98.0 0.25 233 58.25
4 7/5/2016 Beach 159 118 78 135.0 0.25 277 69.25
5 7/6/2016 Beach 103 69 82 90.0 0.25 172 43.00

  <script>
    const buttonEl =
      document.querySelector('#df-8329c7e5-dc41-41e9-aa8f-d31cdc115b69 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-8329c7e5-dc41-41e9-aa8f-d31cdc115b69');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

데이터 인덱싱

1
juice[5:10]

Date Location Lemon Orange Temperature Leaflets Price Sold Revenue
5 7/6/2016 Beach 103 69 82 90.0 0.25 172 43.00
6 7/6/2016 Beach 103 69 82 90.0 0.25 172 43.00
7 7/7/2016 Beach 143 101 81 135.0 0.25 244 61.00
8 NaN Beach 123 86 82 113.0 0.25 209 52.25
9 7/9/2016 Beach 134 95 80 126.0 0.25 229 57.25

  <script>
    const buttonEl =
      document.querySelector('#df-e5a46c70-33e1-4ae6-868e-0fbcf1b0c121 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-e5a46c70-33e1-4ae6-868e-0fbcf1b0c121');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

boolean 값 활용 (조건식)

1
2
# Location 값이 Park인 경우
juice[juice['Location'] == 'Park']

Date Location Lemon Orange Temperature Leaflets Price Sold Revenue
0 7/1/2016 Park 97 67 70 90.0 0.25 164 41.00
1 7/2/2016 Park 98 67 72 90.0 0.25 165 41.25
2 7/3/2016 Park 110 77 71 104.0 0.25 187 46.75
18 7/18/2016 Park 131 92 81 122.0 0.50 223 111.50
19 7/19/2016 Park 122 85 78 113.0 0.50 207 103.50
20 7/20/2016 Park 71 42 70 NaN 0.50 113 56.50
21 7/21/2016 Park 83 50 77 90.0 0.50 133 66.50
22 7/22/2016 Park 112 75 80 108.0 0.50 187 93.50
23 7/23/2016 Park 120 82 81 117.0 0.50 202 101.00
24 7/24/2016 Park 121 82 82 117.0 0.50 203 101.50
25 7/25/2016 Park 156 113 84 135.0 0.50 269 134.50
26 7/26/2016 Park 176 129 83 158.0 0.35 305 106.75
27 7/27/2016 Park 104 68 80 99.0 0.35 172 60.20
28 7/28/2016 Park 96 63 82 90.0 0.35 159 55.65
29 7/29/2016 Park 100 66 81 95.0 0.35 166 58.10

  <script>
    const buttonEl =
      document.querySelector('#df-6440b49e-9b42-4a3b-9edd-37f9b7d5b0a6 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-6440b49e-9b42-4a3b-9edd-37f9b7d5b0a6');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>
1
2
# Leaflets 값이 120 이상인 경우
juice[juice['Leaflets'] >= 120]

Date Location Lemon Orange Temperature Leaflets Price Sold Revenue
4 7/5/2016 Beach 159 118 78 135.0 0.25 277 69.25
7 7/7/2016 Beach 143 101 81 135.0 0.25 244 61.00
9 7/9/2016 Beach 134 95 80 126.0 0.25 229 57.25
10 7/10/2016 Beach 140 98 82 131.0 0.25 238 59.50
11 7/11/2016 Beach 162 120 83 135.0 0.25 282 70.50
17 7/17/2016 Beach 115 76 77 126.0 0.50 191 95.50
18 7/18/2016 Park 131 92 81 122.0 0.50 223 111.50
25 7/25/2016 Park 156 113 84 135.0 0.50 269 134.50
26 7/26/2016 Park 176 129 83 158.0 0.35 305 106.75

  <script>
    const buttonEl =
      document.querySelector('#df-92086f96-d263-4c70-85ec-af04ffb445c1 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-92086f96-d263-4c70-85ec-af04ffb445c1');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

iloc vs loc

  • iloc : index 기반, 속도↑, 대용량 데이터에 적합
    • syntax : df.iloc[row_index, column_index]
  • loc : label or boolean(조건식) 기반, 가독성↑
    • syntax : df.loc[row_label, column_label]
1
2
%%time
juice.iloc[0:3, 0:2] # 해당 인덱스 미포함
CPU times: user 514 µs, sys: 40 µs, total: 554 µs
Wall time: 523 µs

Date Location
0 7/1/2016 Park
1 7/2/2016 Park
2 7/3/2016 Park

  <script>
    const buttonEl =
      document.querySelector('#df-219659e3-19a4-418f-84d6-7d01a8076e00 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-219659e3-19a4-418f-84d6-7d01a8076e00');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>
1
2
%%time
juice.loc[0:2, ["Date",'Location']] # 해당 라벨명 포함
CPU times: user 1.67 ms, sys: 0 ns, total: 1.67 ms
Wall time: 5.19 ms

Date Location
0 7/1/2016 Park
1 7/2/2016 Park
2 7/3/2016 Park

  <script>
    const buttonEl =
      document.querySelector('#df-5e3530fd-7c31-4a6f-8a4f-dea8d6d42e78 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-5e3530fd-7c31-4a6f-8a4f-dea8d6d42e78');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>
1
2
3
4
# juice.iloc[juice['Leaflets'] >= 130, ['Date', 'Location', 'Leaflets']]
# Error: iLocation based boolean indexing on an integer type is not available

juice.loc[juice['Leaflets'] >= 130, ['Date', 'Location', 'Leaflets']]

Date Location Leaflets
4 7/5/2016 Beach 135.0
7 7/7/2016 Beach 135.0
10 7/10/2016 Beach 131.0
11 7/11/2016 Beach 135.0
25 7/25/2016 Park 135.0
26 7/26/2016 Park 158.0

  <script>
    const buttonEl =
      document.querySelector('#df-80998d16-2836-479c-9ea3-75efcac52299 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-80998d16-2836-479c-9ea3-75efcac52299');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

데이터 정렬

  • sort_values() 함수
1
2
# Revenue 기준 오름차순
juice.sort_values(by=['Revenue']).head()

Date Location Lemon Orange Temperature Leaflets Price Sold Revenue
0 7/1/2016 Park 97 67 70 90.0 0.25 164 41.00
1 7/2/2016 Park 98 67 72 90.0 0.25 165 41.25
6 7/6/2016 Beach 103 69 82 90.0 0.25 172 43.00
5 7/6/2016 Beach 103 69 82 90.0 0.25 172 43.00
31 7/31/2016 Beach 76 47 82 68.0 0.35 123 43.05

  <script>
    const buttonEl =
      document.querySelector('#df-4d6ed323-d5df-47ed-893d-db57c2b39110 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-4d6ed323-d5df-47ed-893d-db57c2b39110');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>
1
2
# Revenue 기준 내림차순
juice.sort_values(by=['Revenue'], ascending = False).head()

Date Location Lemon Orange Temperature Leaflets Price Sold Revenue
25 7/25/2016 Park 156 113 84 135.0 0.50 269 134.50
18 7/18/2016 Park 131 92 81 122.0 0.50 223 111.50
26 7/26/2016 Park 176 129 83 158.0 0.35 305 106.75
19 7/19/2016 Park 122 85 78 113.0 0.50 207 103.50
24 7/24/2016 Park 121 82 82 117.0 0.50 203 101.50

  <script>
    const buttonEl =
      document.querySelector('#df-ecf0c228-c2c7-4d62-aa08-e194d9fd1e4e button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-ecf0c228-c2c7-4d62-aa08-e194d9fd1e4e');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>
1
2
# Price 기준 내림차순, Temperature 기준 오름차순
juice.sort_values(by=['Price', 'Temperature'], ascending = [False, True])

Date Location Lemon Orange Temperature Leaflets Price Sold Revenue
20 7/20/2016 Park 71 42 70 NaN 0.50 113 56.50
16 7/16/2016 Beach 81 50 74 90.0 0.50 131 65.50
15 7/15/2016 Beach 98 62 75 108.0 0.50 160 80.00
17 7/17/2016 Beach 115 76 77 126.0 0.50 191 95.50
21 7/21/2016 Park 83 50 77 90.0 0.50 133 66.50
19 7/19/2016 Park 122 85 78 113.0 0.50 207 103.50
22 7/22/2016 Park 112 75 80 108.0 0.50 187 93.50
18 7/18/2016 Park 131 92 81 122.0 0.50 223 111.50
23 7/23/2016 Park 120 82 81 117.0 0.50 202 101.00
24 7/24/2016 Park 121 82 82 117.0 0.50 203 101.50
25 7/25/2016 Park 156 113 84 135.0 0.50 269 134.50
27 7/27/2016 Park 104 68 80 99.0 0.35 172 60.20
29 7/29/2016 Park 100 66 81 95.0 0.35 166 58.10
28 7/28/2016 Park 96 63 82 90.0 0.35 159 55.65
30 7/30/2016 Beach 88 57 82 81.0 0.35 145 50.75
31 7/31/2016 Beach 76 47 82 68.0 0.35 123 43.05
26 7/26/2016 Park 176 129 83 158.0 0.35 305 106.75
0 7/1/2016 Park 97 67 70 90.0 0.25 164 41.00
2 7/3/2016 Park 110 77 71 104.0 0.25 187 46.75
1 7/2/2016 Park 98 67 72 90.0 0.25 165 41.25
3 7/4/2016 Beach 134 99 76 98.0 0.25 233 58.25
13 7/13/2016 Beach 109 75 77 99.0 0.25 184 46.00
4 7/5/2016 Beach 159 118 78 135.0 0.25 277 69.25
14 7/14/2016 Beach 122 85 78 113.0 0.25 207 51.75
9 7/9/2016 Beach 134 95 80 126.0 0.25 229 57.25
7 7/7/2016 Beach 143 101 81 135.0 0.25 244 61.00
5 7/6/2016 Beach 103 69 82 90.0 0.25 172 43.00
6 7/6/2016 Beach 103 69 82 90.0 0.25 172 43.00
8 NaN Beach 123 86 82 113.0 0.25 209 52.25
10 7/10/2016 Beach 140 98 82 131.0 0.25 238 59.50
11 7/11/2016 Beach 162 120 83 135.0 0.25 282 70.50
12 7/12/2016 Beach 130 95 84 99.0 0.25 225 56.25

  <script>
    const buttonEl =
      document.querySelector('#df-56042f49-547e-4908-99b7-0e79a9445a3d button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-56042f49-547e-4908-99b7-0e79a9445a3d');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>
1
2
3
# index를 새로 지정해서 새로운 객체로 저장
juice2 = juice.sort_values(by=['Price', 'Temperature'], ascending = [False, True]).reset_index(drop=True)
juice2

Date Location Lemon Orange Temperature Leaflets Price Sold Revenue
0 7/20/2016 Park 71 42 70 NaN 0.50 113 56.50
1 7/16/2016 Beach 81 50 74 90.0 0.50 131 65.50
2 7/15/2016 Beach 98 62 75 108.0 0.50 160 80.00
3 7/17/2016 Beach 115 76 77 126.0 0.50 191 95.50
4 7/21/2016 Park 83 50 77 90.0 0.50 133 66.50
5 7/19/2016 Park 122 85 78 113.0 0.50 207 103.50
6 7/22/2016 Park 112 75 80 108.0 0.50 187 93.50
7 7/18/2016 Park 131 92 81 122.0 0.50 223 111.50
8 7/23/2016 Park 120 82 81 117.0 0.50 202 101.00
9 7/24/2016 Park 121 82 82 117.0 0.50 203 101.50
10 7/25/2016 Park 156 113 84 135.0 0.50 269 134.50
11 7/27/2016 Park 104 68 80 99.0 0.35 172 60.20
12 7/29/2016 Park 100 66 81 95.0 0.35 166 58.10
13 7/28/2016 Park 96 63 82 90.0 0.35 159 55.65
14 7/30/2016 Beach 88 57 82 81.0 0.35 145 50.75
15 7/31/2016 Beach 76 47 82 68.0 0.35 123 43.05
16 7/26/2016 Park 176 129 83 158.0 0.35 305 106.75
17 7/1/2016 Park 97 67 70 90.0 0.25 164 41.00
18 7/3/2016 Park 110 77 71 104.0 0.25 187 46.75
19 7/2/2016 Park 98 67 72 90.0 0.25 165 41.25
20 7/4/2016 Beach 134 99 76 98.0 0.25 233 58.25
21 7/13/2016 Beach 109 75 77 99.0 0.25 184 46.00
22 7/5/2016 Beach 159 118 78 135.0 0.25 277 69.25
23 7/14/2016 Beach 122 85 78 113.0 0.25 207 51.75
24 7/9/2016 Beach 134 95 80 126.0 0.25 229 57.25
25 7/7/2016 Beach 143 101 81 135.0 0.25 244 61.00
26 7/6/2016 Beach 103 69 82 90.0 0.25 172 43.00
27 7/6/2016 Beach 103 69 82 90.0 0.25 172 43.00
28 NaN Beach 123 86 82 113.0 0.25 209 52.25
29 7/10/2016 Beach 140 98 82 131.0 0.25 238 59.50
30 7/11/2016 Beach 162 120 83 135.0 0.25 282 70.50
31 7/12/2016 Beach 130 95 84 99.0 0.25 225 56.25

  <script>
    const buttonEl =
      document.querySelector('#df-95b0ec19-306b-4312-a2fb-6c6bf238bb16 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-95b0ec19-306b-4312-a2fb-6c6bf238bb16');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

데이터 그룹화

  • groupby() 함수
  • 그룹별 집계함수를 통해 피벗테이블 생성
1
juice.groupby(by='Location').count()

Date Lemon Orange Temperature Leaflets Price Sold Revenue
Location
Beach 16 17 17 17 17 17 17 17
Park 15 15 15 15 14 15 15 15

  <script>
    const buttonEl =
      document.querySelector('#df-464f987e-371e-4805-8b50-ee52bdc321e8 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-464f987e-371e-4805-8b50-ee52bdc321e8');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>
1
2
3
4
import numpy as np

# Location 그룹별 Lemon, Orange 변수에 대한 집계함수
juice.groupby(by='Location')[['Lemon','Orange']].agg([max, min, sum, np.mean])

Lemon Orange
max min sum mean max min sum mean
Location
Beach 162 76 2020 118.823529 120 47 1402 82.470588
Park 176 71 1697 113.133333 129 42 1158 77.200000

  <script>
    const buttonEl =
      document.querySelector('#df-1d771570-530b-4471-945e-e1db8f8b3457 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-1d771570-530b-4471-945e-e1db8f8b3457');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>
Share