데이터 전처리 프로세스
- 중복값 제거 및 결측치 처리
- 완전 무작위(MCAR) / 무작위(MAR) / 비무작위(NMAR)
- 제거, 치환, 모델 기반 처리 등
- 이상치 탐지 및 처리
- Feature Engineering
- 정규화, 표준화, 로그변환, 벡터화 등
- PCA, EFA 등을 통한 차원 축소
Pandas 라이브러리
Pandas의 기본 자료형
- Series 객체, DataFrame 객체
- Index: 숫자 또는 문자, 중복X
- Series: Index & Column 1개
- DataFrame: Index & Column 2개 이상
- 각 객체에 따라 사용 가능한 method가 상이함
Pandas 라이브러리 불러오기
1 2
| import pandas as pd print(pd.__version__)
|
1.3.5
테스트
1 2 3 4 5
| temp_dic = {'col1' : [1, 2, 3], 'col2' : [4, 5, 6]} df = pd.DataFrame(temp_dic) print(df) print(type(df))
|
col1 col2
0 1 4
1 2 5
2 3 6
<class 'pandas.core.frame.DataFrame'>
1 2 3 4 5
| temp_dic = {'a':1, 'b':2, 'c':3} ser = pd.Series(temp_dic) print(ser) print(type(ser))
|
a 1
b 2
c 3
dtype: int64
<class 'pandas.core.series.Series'>
구글 드라이브 연동
1 2
| from google.colab import drive drive.mount('/content/drive')
|
Mounted at /content/drive
데이터 불러오기
1 2 3
| DATA_PATH = '/content/drive/MyDrive/Colab Notebooks/Data/Lemonade2016.csv' juice = pd.read_csv(DATA_PATH) print(juice)
|
Date Location Lemon Orange Temperature Leaflets Price
0 7/1/2016 Park 97 67 70 90.0 0.25
1 7/2/2016 Park 98 67 72 90.0 0.25
2 7/3/2016 Park 110 77 71 104.0 0.25
3 7/4/2016 Beach 134 99 76 98.0 0.25
4 7/5/2016 Beach 159 118 78 135.0 0.25
5 7/6/2016 Beach 103 69 82 90.0 0.25
6 7/6/2016 Beach 103 69 82 90.0 0.25
7 7/7/2016 Beach 143 101 81 135.0 0.25
8 NaN Beach 123 86 82 113.0 0.25
9 7/9/2016 Beach 134 95 80 126.0 0.25
10 7/10/2016 Beach 140 98 82 131.0 0.25
11 7/11/2016 Beach 162 120 83 135.0 0.25
12 7/12/2016 Beach 130 95 84 99.0 0.25
13 7/13/2016 Beach 109 75 77 99.0 0.25
14 7/14/2016 Beach 122 85 78 113.0 0.25
15 7/15/2016 Beach 98 62 75 108.0 0.50
16 7/16/2016 Beach 81 50 74 90.0 0.50
17 7/17/2016 Beach 115 76 77 126.0 0.50
18 7/18/2016 Park 131 92 81 122.0 0.50
19 7/19/2016 Park 122 85 78 113.0 0.50
20 7/20/2016 Park 71 42 70 NaN 0.50
21 7/21/2016 Park 83 50 77 90.0 0.50
22 7/22/2016 Park 112 75 80 108.0 0.50
23 7/23/2016 Park 120 82 81 117.0 0.50
24 7/24/2016 Park 121 82 82 117.0 0.50
25 7/25/2016 Park 156 113 84 135.0 0.50
26 7/26/2016 Park 176 129 83 158.0 0.35
27 7/27/2016 Park 104 68 80 99.0 0.35
28 7/28/2016 Park 96 63 82 90.0 0.35
29 7/29/2016 Park 100 66 81 95.0 0.35
30 7/30/2016 Beach 88 57 82 81.0 0.35
31 7/31/2016 Beach 76 47 82 68.0 0.35
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 31 non-null object
1 Location 32 non-null object
2 Lemon 32 non-null int64
3 Orange 32 non-null int64
4 Temperature 32 non-null int64
5 Leaflets 31 non-null float64
6 Price 32 non-null float64
dtypes: float64(2), int64(3), object(2)
memory usage: 1.9+ KB
1 2 3
| print(juice.head()) print("-------------------------------------------------------------------") print(juice.tail())
|
Date Location Lemon Orange Temperature Leaflets Price
0 7/1/2016 Park 97 67 70 90.0 0.25
1 7/2/2016 Park 98 67 72 90.0 0.25
2 7/3/2016 Park 110 77 71 104.0 0.25
3 7/4/2016 Beach 134 99 76 98.0 0.25
4 7/5/2016 Beach 159 118 78 135.0 0.25
-------------------------------------------------------------------
Date Location Lemon Orange Temperature Leaflets Price
27 7/27/2016 Park 104 68 80 99.0 0.35
28 7/28/2016 Park 96 63 82 90.0 0.35
29 7/29/2016 Park 100 66 81 95.0 0.35
30 7/30/2016 Beach 88 57 82 81.0 0.35
31 7/31/2016 Beach 76 47 82 68.0 0.35
- describe() : 기술통계량 확인 (int형, float형 변수)
- value_counts() : 범주형 변수 빈도 수 확인
1 2
| print(juice.describe()) print("** type(juice.describe()) :", type(juice.describe()))
|
Lemon Orange Temperature Leaflets Price
count 32.000000 32.000000 32.000000 31.000000 32.000000
mean 116.156250 80.000000 78.968750 108.548387 0.354687
std 25.823357 21.863211 4.067847 20.117718 0.113137
min 71.000000 42.000000 70.000000 68.000000 0.250000
25% 98.000000 66.750000 77.000000 90.000000 0.250000
50% 113.500000 76.500000 80.500000 108.000000 0.350000
75% 131.750000 95.000000 82.000000 124.000000 0.500000
max 176.000000 129.000000 84.000000 158.000000 0.500000
** type(juice.describe()) : <class 'pandas.core.frame.DataFrame'>
1 2
| print(juice['Location'].value_counts()) print("** type(juice.describe()) :", type(juice['Location'].value_counts()))
|
Beach 17
Park 15
Name: Location, dtype: int64
** type(juice.describe()) : <class 'pandas.core.series.Series'>
데이터 다루기
1 2
| juice['Sold'] = 0 juice.head()
|
|
Date |
Location |
Lemon |
Orange |
Temperature |
Leaflets |
Price |
Sold |
0 |
7/1/2016 |
Park |
97 |
67 |
70 |
90.0 |
0.25 |
0 |
1 |
7/2/2016 |
Park |
98 |
67 |
72 |
90.0 |
0.25 |
0 |
2 |
7/3/2016 |
Park |
110 |
77 |
71 |
104.0 |
0.25 |
0 |
3 |
7/4/2016 |
Beach |
134 |
99 |
76 |
98.0 |
0.25 |
0 |
4 |
7/5/2016 |
Beach |
159 |
118 |
78 |
135.0 |
0.25 |
0 |
<script>
const buttonEl =
document.querySelector('#df-d7644e91-b826-4359-8387-1580b5266658 button.colab-df-convert');
buttonEl.style.display =
google.colab.kernel.accessAllowed ? 'block' : 'none';
async function convertToInteractive(key) {
const element = document.querySelector('#df-d7644e91-b826-4359-8387-1580b5266658');
const dataTable =
await google.colab.kernel.invokeFunction('convertToInteractive',
[key], {});
if (!dataTable) return;
const docLinkHtml = 'Like what you see? Visit the ' +
'<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
+ ' to learn more about interactive tables.';
element.innerHTML = '';
dataTable['output_type'] = 'display_data';
await google.colab.output.renderOutput(dataTable, element);
const docLink = document.createElement('div');
docLink.innerHTML = docLinkHtml;
element.appendChild(docLink);
}
</script>
</div>
1 2
| juice['Sold'] = juice['Lemon'] + juice['Orange'] juice.head()
|
|
Date |
Location |
Lemon |
Orange |
Temperature |
Leaflets |
Price |
Sold |
0 |
7/1/2016 |
Park |
97 |
67 |
70 |
90.0 |
0.25 |
164 |
1 |
7/2/2016 |
Park |
98 |
67 |
72 |
90.0 |
0.25 |
165 |
2 |
7/3/2016 |
Park |
110 |
77 |
71 |
104.0 |
0.25 |
187 |
3 |
7/4/2016 |
Beach |
134 |
99 |
76 |
98.0 |
0.25 |
233 |
4 |
7/5/2016 |
Beach |
159 |
118 |
78 |
135.0 |
0.25 |
277 |
<script>
const buttonEl =
document.querySelector('#df-2d21cec2-5539-4100-b9f5-ac71828a5d25 button.colab-df-convert');
buttonEl.style.display =
google.colab.kernel.accessAllowed ? 'block' : 'none';
async function convertToInteractive(key) {
const element = document.querySelector('#df-2d21cec2-5539-4100-b9f5-ac71828a5d25');
const dataTable =
await google.colab.kernel.invokeFunction('convertToInteractive',
[key], {});
if (!dataTable) return;
const docLinkHtml = 'Like what you see? Visit the ' +
'<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
+ ' to learn more about interactive tables.';
element.innerHTML = '';
dataTable['output_type'] = 'display_data';
await google.colab.output.renderOutput(dataTable, element);
const docLink = document.createElement('div');
docLink.innerHTML = docLinkHtml;
element.appendChild(docLink);
}
</script>
</div>
- 매출액(Revenue) = 가격(Price) * 판매량(Sold)
1 2
| juice['Revenue'] = juice['Price'] * juice['Sold'] juice.head()
|
|
Date |
Location |
Lemon |
Orange |
Temperature |
Leaflets |
Price |
Sold |
Revenue |
0 |
7/1/2016 |
Park |
97 |
67 |
70 |
90.0 |
0.25 |
164 |
41.00 |
1 |
7/2/2016 |
Park |
98 |
67 |
72 |
90.0 |
0.25 |
165 |
41.25 |
2 |
7/3/2016 |
Park |
110 |
77 |
71 |
104.0 |
0.25 |
187 |
46.75 |
3 |
7/4/2016 |
Beach |
134 |
99 |
76 |
98.0 |
0.25 |
233 |
58.25 |
4 |
7/5/2016 |
Beach |
159 |
118 |
78 |
135.0 |
0.25 |
277 |
69.25 |
<script>
const buttonEl =
document.querySelector('#df-506a14b7-9d00-4793-9a3f-5c54252c47b5 button.colab-df-convert');
buttonEl.style.display =
google.colab.kernel.accessAllowed ? 'block' : 'none';
async function convertToInteractive(key) {
const element = document.querySelector('#df-506a14b7-9d00-4793-9a3f-5c54252c47b5');
const dataTable =
await google.colab.kernel.invokeFunction('convertToInteractive',
[key], {});
if (!dataTable) return;
const docLinkHtml = 'Like what you see? Visit the ' +
'<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
+ ' to learn more about interactive tables.';
element.innerHTML = '';
dataTable['output_type'] = 'display_data';
await google.colab.output.renderOutput(dataTable, element);
const docLink = document.createElement('div');
docLink.innerHTML = docLinkHtml;
element.appendChild(docLink);
}
</script>
</div>
- 행 또는 열 제거 : drop(axis=0|1)
- axis=0 : 행 방향(index) 실행
- axis=1 : 열 방향(column) 실행
1 2
| juice_col_drop = juice.drop('Sold', axis=1) juice_col_drop.head()
|
|
Date |
Location |
Lemon |
Orange |
Temperature |
Leaflets |
Price |
Revenue |
0 |
7/1/2016 |
Park |
97 |
67 |
70 |
90.0 |
0.25 |
41.00 |
1 |
7/2/2016 |
Park |
98 |
67 |
72 |
90.0 |
0.25 |
41.25 |
2 |
7/3/2016 |
Park |
110 |
77 |
71 |
104.0 |
0.25 |
46.75 |
3 |
7/4/2016 |
Beach |
134 |
99 |
76 |
98.0 |
0.25 |
58.25 |
4 |
7/5/2016 |
Beach |
159 |
118 |
78 |
135.0 |
0.25 |
69.25 |
<script>
const buttonEl =
document.querySelector('#df-4ee85edb-efbb-47d3-84c6-8b0cab09eb0f button.colab-df-convert');
buttonEl.style.display =
google.colab.kernel.accessAllowed ? 'block' : 'none';
async function convertToInteractive(key) {
const element = document.querySelector('#df-4ee85edb-efbb-47d3-84c6-8b0cab09eb0f');
const dataTable =
await google.colab.kernel.invokeFunction('convertToInteractive',
[key], {});
if (!dataTable) return;
const docLinkHtml = 'Like what you see? Visit the ' +
'<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
+ ' to learn more about interactive tables.';
element.innerHTML = '';
dataTable['output_type'] = 'display_data';
await google.colab.output.renderOutput(dataTable, element);
const docLink = document.createElement('div');
docLink.innerHTML = docLinkHtml;
element.appendChild(docLink);
}
</script>
</div>
1 2
| juice_ind_drop = juice.drop(2, axis=0) juice_ind_drop.head()
|
|
Date |
Location |
Lemon |
Orange |
Temperature |
Leaflets |
Price |
Sold |
Revenue |
0 |
7/1/2016 |
Park |
97 |
67 |
70 |
90.0 |
0.25 |
164 |
41.00 |
1 |
7/2/2016 |
Park |
98 |
67 |
72 |
90.0 |
0.25 |
165 |
41.25 |
3 |
7/4/2016 |
Beach |
134 |
99 |
76 |
98.0 |
0.25 |
233 |
58.25 |
4 |
7/5/2016 |
Beach |
159 |
118 |
78 |
135.0 |
0.25 |
277 |
69.25 |
5 |
7/6/2016 |
Beach |
103 |
69 |
82 |
90.0 |
0.25 |
172 |
43.00 |
<script>
const buttonEl =
document.querySelector('#df-8329c7e5-dc41-41e9-aa8f-d31cdc115b69 button.colab-df-convert');
buttonEl.style.display =
google.colab.kernel.accessAllowed ? 'block' : 'none';
async function convertToInteractive(key) {
const element = document.querySelector('#df-8329c7e5-dc41-41e9-aa8f-d31cdc115b69');
const dataTable =
await google.colab.kernel.invokeFunction('convertToInteractive',
[key], {});
if (!dataTable) return;
const docLinkHtml = 'Like what you see? Visit the ' +
'<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
+ ' to learn more about interactive tables.';
element.innerHTML = '';
dataTable['output_type'] = 'display_data';
await google.colab.output.renderOutput(dataTable, element);
const docLink = document.createElement('div');
docLink.innerHTML = docLinkHtml;
element.appendChild(docLink);
}
</script>
</div>
데이터 인덱싱
|
Date |
Location |
Lemon |
Orange |
Temperature |
Leaflets |
Price |
Sold |
Revenue |
5 |
7/6/2016 |
Beach |
103 |
69 |
82 |
90.0 |
0.25 |
172 |
43.00 |
6 |
7/6/2016 |
Beach |
103 |
69 |
82 |
90.0 |
0.25 |
172 |
43.00 |
7 |
7/7/2016 |
Beach |
143 |
101 |
81 |
135.0 |
0.25 |
244 |
61.00 |
8 |
NaN |
Beach |
123 |
86 |
82 |
113.0 |
0.25 |
209 |
52.25 |
9 |
7/9/2016 |
Beach |
134 |
95 |
80 |
126.0 |
0.25 |
229 |
57.25 |
<script>
const buttonEl =
document.querySelector('#df-e5a46c70-33e1-4ae6-868e-0fbcf1b0c121 button.colab-df-convert');
buttonEl.style.display =
google.colab.kernel.accessAllowed ? 'block' : 'none';
async function convertToInteractive(key) {
const element = document.querySelector('#df-e5a46c70-33e1-4ae6-868e-0fbcf1b0c121');
const dataTable =
await google.colab.kernel.invokeFunction('convertToInteractive',
[key], {});
if (!dataTable) return;
const docLinkHtml = 'Like what you see? Visit the ' +
'<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
+ ' to learn more about interactive tables.';
element.innerHTML = '';
dataTable['output_type'] = 'display_data';
await google.colab.output.renderOutput(dataTable, element);
const docLink = document.createElement('div');
docLink.innerHTML = docLinkHtml;
element.appendChild(docLink);
}
</script>
</div>
boolean 값 활용 (조건식)
1 2
| juice[juice['Location'] == 'Park']
|
|
Date |
Location |
Lemon |
Orange |
Temperature |
Leaflets |
Price |
Sold |
Revenue |
0 |
7/1/2016 |
Park |
97 |
67 |
70 |
90.0 |
0.25 |
164 |
41.00 |
1 |
7/2/2016 |
Park |
98 |
67 |
72 |
90.0 |
0.25 |
165 |
41.25 |
2 |
7/3/2016 |
Park |
110 |
77 |
71 |
104.0 |
0.25 |
187 |
46.75 |
18 |
7/18/2016 |
Park |
131 |
92 |
81 |
122.0 |
0.50 |
223 |
111.50 |
19 |
7/19/2016 |
Park |
122 |
85 |
78 |
113.0 |
0.50 |
207 |
103.50 |
20 |
7/20/2016 |
Park |
71 |
42 |
70 |
NaN |
0.50 |
113 |
56.50 |
21 |
7/21/2016 |
Park |
83 |
50 |
77 |
90.0 |
0.50 |
133 |
66.50 |
22 |
7/22/2016 |
Park |
112 |
75 |
80 |
108.0 |
0.50 |
187 |
93.50 |
23 |
7/23/2016 |
Park |
120 |
82 |
81 |
117.0 |
0.50 |
202 |
101.00 |
24 |
7/24/2016 |
Park |
121 |
82 |
82 |
117.0 |
0.50 |
203 |
101.50 |
25 |
7/25/2016 |
Park |
156 |
113 |
84 |
135.0 |
0.50 |
269 |
134.50 |
26 |
7/26/2016 |
Park |
176 |
129 |
83 |
158.0 |
0.35 |
305 |
106.75 |
27 |
7/27/2016 |
Park |
104 |
68 |
80 |
99.0 |
0.35 |
172 |
60.20 |
28 |
7/28/2016 |
Park |
96 |
63 |
82 |
90.0 |
0.35 |
159 |
55.65 |
29 |
7/29/2016 |
Park |
100 |
66 |
81 |
95.0 |
0.35 |
166 |
58.10 |
<script>
const buttonEl =
document.querySelector('#df-6440b49e-9b42-4a3b-9edd-37f9b7d5b0a6 button.colab-df-convert');
buttonEl.style.display =
google.colab.kernel.accessAllowed ? 'block' : 'none';
async function convertToInteractive(key) {
const element = document.querySelector('#df-6440b49e-9b42-4a3b-9edd-37f9b7d5b0a6');
const dataTable =
await google.colab.kernel.invokeFunction('convertToInteractive',
[key], {});
if (!dataTable) return;
const docLinkHtml = 'Like what you see? Visit the ' +
'<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
+ ' to learn more about interactive tables.';
element.innerHTML = '';
dataTable['output_type'] = 'display_data';
await google.colab.output.renderOutput(dataTable, element);
const docLink = document.createElement('div');
docLink.innerHTML = docLinkHtml;
element.appendChild(docLink);
}
</script>
</div>
1 2
| juice[juice['Leaflets'] >= 120]
|
|
Date |
Location |
Lemon |
Orange |
Temperature |
Leaflets |
Price |
Sold |
Revenue |
4 |
7/5/2016 |
Beach |
159 |
118 |
78 |
135.0 |
0.25 |
277 |
69.25 |
7 |
7/7/2016 |
Beach |
143 |
101 |
81 |
135.0 |
0.25 |
244 |
61.00 |
9 |
7/9/2016 |
Beach |
134 |
95 |
80 |
126.0 |
0.25 |
229 |
57.25 |
10 |
7/10/2016 |
Beach |
140 |
98 |
82 |
131.0 |
0.25 |
238 |
59.50 |
11 |
7/11/2016 |
Beach |
162 |
120 |
83 |
135.0 |
0.25 |
282 |
70.50 |
17 |
7/17/2016 |
Beach |
115 |
76 |
77 |
126.0 |
0.50 |
191 |
95.50 |
18 |
7/18/2016 |
Park |
131 |
92 |
81 |
122.0 |
0.50 |
223 |
111.50 |
25 |
7/25/2016 |
Park |
156 |
113 |
84 |
135.0 |
0.50 |
269 |
134.50 |
26 |
7/26/2016 |
Park |
176 |
129 |
83 |
158.0 |
0.35 |
305 |
106.75 |
<script>
const buttonEl =
document.querySelector('#df-92086f96-d263-4c70-85ec-af04ffb445c1 button.colab-df-convert');
buttonEl.style.display =
google.colab.kernel.accessAllowed ? 'block' : 'none';
async function convertToInteractive(key) {
const element = document.querySelector('#df-92086f96-d263-4c70-85ec-af04ffb445c1');
const dataTable =
await google.colab.kernel.invokeFunction('convertToInteractive',
[key], {});
if (!dataTable) return;
const docLinkHtml = 'Like what you see? Visit the ' +
'<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
+ ' to learn more about interactive tables.';
element.innerHTML = '';
dataTable['output_type'] = 'display_data';
await google.colab.output.renderOutput(dataTable, element);
const docLink = document.createElement('div');
docLink.innerHTML = docLinkHtml;
element.appendChild(docLink);
}
</script>
</div>
iloc vs loc
- iloc : index 기반, 속도↑, 대용량 데이터에 적합
- syntax : df.iloc[row_index, column_index]
- loc : label or boolean(조건식) 기반, 가독성↑
- syntax : df.loc[row_label, column_label]
1 2
| %%time juice.iloc[0:3, 0:2]
|
CPU times: user 514 µs, sys: 40 µs, total: 554 µs
Wall time: 523 µs
|
Date |
Location |
0 |
7/1/2016 |
Park |
1 |
7/2/2016 |
Park |
2 |
7/3/2016 |
Park |
<script>
const buttonEl =
document.querySelector('#df-219659e3-19a4-418f-84d6-7d01a8076e00 button.colab-df-convert');
buttonEl.style.display =
google.colab.kernel.accessAllowed ? 'block' : 'none';
async function convertToInteractive(key) {
const element = document.querySelector('#df-219659e3-19a4-418f-84d6-7d01a8076e00');
const dataTable =
await google.colab.kernel.invokeFunction('convertToInteractive',
[key], {});
if (!dataTable) return;
const docLinkHtml = 'Like what you see? Visit the ' +
'<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
+ ' to learn more about interactive tables.';
element.innerHTML = '';
dataTable['output_type'] = 'display_data';
await google.colab.output.renderOutput(dataTable, element);
const docLink = document.createElement('div');
docLink.innerHTML = docLinkHtml;
element.appendChild(docLink);
}
</script>
</div>
1 2
| %%time juice.loc[0:2, ["Date",'Location']]
|
CPU times: user 1.67 ms, sys: 0 ns, total: 1.67 ms
Wall time: 5.19 ms
|
Date |
Location |
0 |
7/1/2016 |
Park |
1 |
7/2/2016 |
Park |
2 |
7/3/2016 |
Park |
<script>
const buttonEl =
document.querySelector('#df-5e3530fd-7c31-4a6f-8a4f-dea8d6d42e78 button.colab-df-convert');
buttonEl.style.display =
google.colab.kernel.accessAllowed ? 'block' : 'none';
async function convertToInteractive(key) {
const element = document.querySelector('#df-5e3530fd-7c31-4a6f-8a4f-dea8d6d42e78');
const dataTable =
await google.colab.kernel.invokeFunction('convertToInteractive',
[key], {});
if (!dataTable) return;
const docLinkHtml = 'Like what you see? Visit the ' +
'<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
+ ' to learn more about interactive tables.';
element.innerHTML = '';
dataTable['output_type'] = 'display_data';
await google.colab.output.renderOutput(dataTable, element);
const docLink = document.createElement('div');
docLink.innerHTML = docLinkHtml;
element.appendChild(docLink);
}
</script>
</div>
1 2 3 4
|
juice.loc[juice['Leaflets'] >= 130, ['Date', 'Location', 'Leaflets']]
|
|
Date |
Location |
Leaflets |
4 |
7/5/2016 |
Beach |
135.0 |
7 |
7/7/2016 |
Beach |
135.0 |
10 |
7/10/2016 |
Beach |
131.0 |
11 |
7/11/2016 |
Beach |
135.0 |
25 |
7/25/2016 |
Park |
135.0 |
26 |
7/26/2016 |
Park |
158.0 |
<script>
const buttonEl =
document.querySelector('#df-80998d16-2836-479c-9ea3-75efcac52299 button.colab-df-convert');
buttonEl.style.display =
google.colab.kernel.accessAllowed ? 'block' : 'none';
async function convertToInteractive(key) {
const element = document.querySelector('#df-80998d16-2836-479c-9ea3-75efcac52299');
const dataTable =
await google.colab.kernel.invokeFunction('convertToInteractive',
[key], {});
if (!dataTable) return;
const docLinkHtml = 'Like what you see? Visit the ' +
'<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
+ ' to learn more about interactive tables.';
element.innerHTML = '';
dataTable['output_type'] = 'display_data';
await google.colab.output.renderOutput(dataTable, element);
const docLink = document.createElement('div');
docLink.innerHTML = docLinkHtml;
element.appendChild(docLink);
}
</script>
</div>
데이터 정렬
1 2
| juice.sort_values(by=['Revenue']).head()
|
|
Date |
Location |
Lemon |
Orange |
Temperature |
Leaflets |
Price |
Sold |
Revenue |
0 |
7/1/2016 |
Park |
97 |
67 |
70 |
90.0 |
0.25 |
164 |
41.00 |
1 |
7/2/2016 |
Park |
98 |
67 |
72 |
90.0 |
0.25 |
165 |
41.25 |
6 |
7/6/2016 |
Beach |
103 |
69 |
82 |
90.0 |
0.25 |
172 |
43.00 |
5 |
7/6/2016 |
Beach |
103 |
69 |
82 |
90.0 |
0.25 |
172 |
43.00 |
31 |
7/31/2016 |
Beach |
76 |
47 |
82 |
68.0 |
0.35 |
123 |
43.05 |
<script>
const buttonEl =
document.querySelector('#df-4d6ed323-d5df-47ed-893d-db57c2b39110 button.colab-df-convert');
buttonEl.style.display =
google.colab.kernel.accessAllowed ? 'block' : 'none';
async function convertToInteractive(key) {
const element = document.querySelector('#df-4d6ed323-d5df-47ed-893d-db57c2b39110');
const dataTable =
await google.colab.kernel.invokeFunction('convertToInteractive',
[key], {});
if (!dataTable) return;
const docLinkHtml = 'Like what you see? Visit the ' +
'<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
+ ' to learn more about interactive tables.';
element.innerHTML = '';
dataTable['output_type'] = 'display_data';
await google.colab.output.renderOutput(dataTable, element);
const docLink = document.createElement('div');
docLink.innerHTML = docLinkHtml;
element.appendChild(docLink);
}
</script>
</div>
1 2
| juice.sort_values(by=['Revenue'], ascending = False).head()
|
|
Date |
Location |
Lemon |
Orange |
Temperature |
Leaflets |
Price |
Sold |
Revenue |
25 |
7/25/2016 |
Park |
156 |
113 |
84 |
135.0 |
0.50 |
269 |
134.50 |
18 |
7/18/2016 |
Park |
131 |
92 |
81 |
122.0 |
0.50 |
223 |
111.50 |
26 |
7/26/2016 |
Park |
176 |
129 |
83 |
158.0 |
0.35 |
305 |
106.75 |
19 |
7/19/2016 |
Park |
122 |
85 |
78 |
113.0 |
0.50 |
207 |
103.50 |
24 |
7/24/2016 |
Park |
121 |
82 |
82 |
117.0 |
0.50 |
203 |
101.50 |
<script>
const buttonEl =
document.querySelector('#df-ecf0c228-c2c7-4d62-aa08-e194d9fd1e4e button.colab-df-convert');
buttonEl.style.display =
google.colab.kernel.accessAllowed ? 'block' : 'none';
async function convertToInteractive(key) {
const element = document.querySelector('#df-ecf0c228-c2c7-4d62-aa08-e194d9fd1e4e');
const dataTable =
await google.colab.kernel.invokeFunction('convertToInteractive',
[key], {});
if (!dataTable) return;
const docLinkHtml = 'Like what you see? Visit the ' +
'<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
+ ' to learn more about interactive tables.';
element.innerHTML = '';
dataTable['output_type'] = 'display_data';
await google.colab.output.renderOutput(dataTable, element);
const docLink = document.createElement('div');
docLink.innerHTML = docLinkHtml;
element.appendChild(docLink);
}
</script>
</div>
1 2
| juice.sort_values(by=['Price', 'Temperature'], ascending = [False, True])
|
|
Date |
Location |
Lemon |
Orange |
Temperature |
Leaflets |
Price |
Sold |
Revenue |
20 |
7/20/2016 |
Park |
71 |
42 |
70 |
NaN |
0.50 |
113 |
56.50 |
16 |
7/16/2016 |
Beach |
81 |
50 |
74 |
90.0 |
0.50 |
131 |
65.50 |
15 |
7/15/2016 |
Beach |
98 |
62 |
75 |
108.0 |
0.50 |
160 |
80.00 |
17 |
7/17/2016 |
Beach |
115 |
76 |
77 |
126.0 |
0.50 |
191 |
95.50 |
21 |
7/21/2016 |
Park |
83 |
50 |
77 |
90.0 |
0.50 |
133 |
66.50 |
19 |
7/19/2016 |
Park |
122 |
85 |
78 |
113.0 |
0.50 |
207 |
103.50 |
22 |
7/22/2016 |
Park |
112 |
75 |
80 |
108.0 |
0.50 |
187 |
93.50 |
18 |
7/18/2016 |
Park |
131 |
92 |
81 |
122.0 |
0.50 |
223 |
111.50 |
23 |
7/23/2016 |
Park |
120 |
82 |
81 |
117.0 |
0.50 |
202 |
101.00 |
24 |
7/24/2016 |
Park |
121 |
82 |
82 |
117.0 |
0.50 |
203 |
101.50 |
25 |
7/25/2016 |
Park |
156 |
113 |
84 |
135.0 |
0.50 |
269 |
134.50 |
27 |
7/27/2016 |
Park |
104 |
68 |
80 |
99.0 |
0.35 |
172 |
60.20 |
29 |
7/29/2016 |
Park |
100 |
66 |
81 |
95.0 |
0.35 |
166 |
58.10 |
28 |
7/28/2016 |
Park |
96 |
63 |
82 |
90.0 |
0.35 |
159 |
55.65 |
30 |
7/30/2016 |
Beach |
88 |
57 |
82 |
81.0 |
0.35 |
145 |
50.75 |
31 |
7/31/2016 |
Beach |
76 |
47 |
82 |
68.0 |
0.35 |
123 |
43.05 |
26 |
7/26/2016 |
Park |
176 |
129 |
83 |
158.0 |
0.35 |
305 |
106.75 |
0 |
7/1/2016 |
Park |
97 |
67 |
70 |
90.0 |
0.25 |
164 |
41.00 |
2 |
7/3/2016 |
Park |
110 |
77 |
71 |
104.0 |
0.25 |
187 |
46.75 |
1 |
7/2/2016 |
Park |
98 |
67 |
72 |
90.0 |
0.25 |
165 |
41.25 |
3 |
7/4/2016 |
Beach |
134 |
99 |
76 |
98.0 |
0.25 |
233 |
58.25 |
13 |
7/13/2016 |
Beach |
109 |
75 |
77 |
99.0 |
0.25 |
184 |
46.00 |
4 |
7/5/2016 |
Beach |
159 |
118 |
78 |
135.0 |
0.25 |
277 |
69.25 |
14 |
7/14/2016 |
Beach |
122 |
85 |
78 |
113.0 |
0.25 |
207 |
51.75 |
9 |
7/9/2016 |
Beach |
134 |
95 |
80 |
126.0 |
0.25 |
229 |
57.25 |
7 |
7/7/2016 |
Beach |
143 |
101 |
81 |
135.0 |
0.25 |
244 |
61.00 |
5 |
7/6/2016 |
Beach |
103 |
69 |
82 |
90.0 |
0.25 |
172 |
43.00 |
6 |
7/6/2016 |
Beach |
103 |
69 |
82 |
90.0 |
0.25 |
172 |
43.00 |
8 |
NaN |
Beach |
123 |
86 |
82 |
113.0 |
0.25 |
209 |
52.25 |
10 |
7/10/2016 |
Beach |
140 |
98 |
82 |
131.0 |
0.25 |
238 |
59.50 |
11 |
7/11/2016 |
Beach |
162 |
120 |
83 |
135.0 |
0.25 |
282 |
70.50 |
12 |
7/12/2016 |
Beach |
130 |
95 |
84 |
99.0 |
0.25 |
225 |
56.25 |
<script>
const buttonEl =
document.querySelector('#df-56042f49-547e-4908-99b7-0e79a9445a3d button.colab-df-convert');
buttonEl.style.display =
google.colab.kernel.accessAllowed ? 'block' : 'none';
async function convertToInteractive(key) {
const element = document.querySelector('#df-56042f49-547e-4908-99b7-0e79a9445a3d');
const dataTable =
await google.colab.kernel.invokeFunction('convertToInteractive',
[key], {});
if (!dataTable) return;
const docLinkHtml = 'Like what you see? Visit the ' +
'<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
+ ' to learn more about interactive tables.';
element.innerHTML = '';
dataTable['output_type'] = 'display_data';
await google.colab.output.renderOutput(dataTable, element);
const docLink = document.createElement('div');
docLink.innerHTML = docLinkHtml;
element.appendChild(docLink);
}
</script>
</div>
1 2 3
| juice2 = juice.sort_values(by=['Price', 'Temperature'], ascending = [False, True]).reset_index(drop=True) juice2
|
|
Date |
Location |
Lemon |
Orange |
Temperature |
Leaflets |
Price |
Sold |
Revenue |
0 |
7/20/2016 |
Park |
71 |
42 |
70 |
NaN |
0.50 |
113 |
56.50 |
1 |
7/16/2016 |
Beach |
81 |
50 |
74 |
90.0 |
0.50 |
131 |
65.50 |
2 |
7/15/2016 |
Beach |
98 |
62 |
75 |
108.0 |
0.50 |
160 |
80.00 |
3 |
7/17/2016 |
Beach |
115 |
76 |
77 |
126.0 |
0.50 |
191 |
95.50 |
4 |
7/21/2016 |
Park |
83 |
50 |
77 |
90.0 |
0.50 |
133 |
66.50 |
5 |
7/19/2016 |
Park |
122 |
85 |
78 |
113.0 |
0.50 |
207 |
103.50 |
6 |
7/22/2016 |
Park |
112 |
75 |
80 |
108.0 |
0.50 |
187 |
93.50 |
7 |
7/18/2016 |
Park |
131 |
92 |
81 |
122.0 |
0.50 |
223 |
111.50 |
8 |
7/23/2016 |
Park |
120 |
82 |
81 |
117.0 |
0.50 |
202 |
101.00 |
9 |
7/24/2016 |
Park |
121 |
82 |
82 |
117.0 |
0.50 |
203 |
101.50 |
10 |
7/25/2016 |
Park |
156 |
113 |
84 |
135.0 |
0.50 |
269 |
134.50 |
11 |
7/27/2016 |
Park |
104 |
68 |
80 |
99.0 |
0.35 |
172 |
60.20 |
12 |
7/29/2016 |
Park |
100 |
66 |
81 |
95.0 |
0.35 |
166 |
58.10 |
13 |
7/28/2016 |
Park |
96 |
63 |
82 |
90.0 |
0.35 |
159 |
55.65 |
14 |
7/30/2016 |
Beach |
88 |
57 |
82 |
81.0 |
0.35 |
145 |
50.75 |
15 |
7/31/2016 |
Beach |
76 |
47 |
82 |
68.0 |
0.35 |
123 |
43.05 |
16 |
7/26/2016 |
Park |
176 |
129 |
83 |
158.0 |
0.35 |
305 |
106.75 |
17 |
7/1/2016 |
Park |
97 |
67 |
70 |
90.0 |
0.25 |
164 |
41.00 |
18 |
7/3/2016 |
Park |
110 |
77 |
71 |
104.0 |
0.25 |
187 |
46.75 |
19 |
7/2/2016 |
Park |
98 |
67 |
72 |
90.0 |
0.25 |
165 |
41.25 |
20 |
7/4/2016 |
Beach |
134 |
99 |
76 |
98.0 |
0.25 |
233 |
58.25 |
21 |
7/13/2016 |
Beach |
109 |
75 |
77 |
99.0 |
0.25 |
184 |
46.00 |
22 |
7/5/2016 |
Beach |
159 |
118 |
78 |
135.0 |
0.25 |
277 |
69.25 |
23 |
7/14/2016 |
Beach |
122 |
85 |
78 |
113.0 |
0.25 |
207 |
51.75 |
24 |
7/9/2016 |
Beach |
134 |
95 |
80 |
126.0 |
0.25 |
229 |
57.25 |
25 |
7/7/2016 |
Beach |
143 |
101 |
81 |
135.0 |
0.25 |
244 |
61.00 |
26 |
7/6/2016 |
Beach |
103 |
69 |
82 |
90.0 |
0.25 |
172 |
43.00 |
27 |
7/6/2016 |
Beach |
103 |
69 |
82 |
90.0 |
0.25 |
172 |
43.00 |
28 |
NaN |
Beach |
123 |
86 |
82 |
113.0 |
0.25 |
209 |
52.25 |
29 |
7/10/2016 |
Beach |
140 |
98 |
82 |
131.0 |
0.25 |
238 |
59.50 |
30 |
7/11/2016 |
Beach |
162 |
120 |
83 |
135.0 |
0.25 |
282 |
70.50 |
31 |
7/12/2016 |
Beach |
130 |
95 |
84 |
99.0 |
0.25 |
225 |
56.25 |
<script>
const buttonEl =
document.querySelector('#df-95b0ec19-306b-4312-a2fb-6c6bf238bb16 button.colab-df-convert');
buttonEl.style.display =
google.colab.kernel.accessAllowed ? 'block' : 'none';
async function convertToInteractive(key) {
const element = document.querySelector('#df-95b0ec19-306b-4312-a2fb-6c6bf238bb16');
const dataTable =
await google.colab.kernel.invokeFunction('convertToInteractive',
[key], {});
if (!dataTable) return;
const docLinkHtml = 'Like what you see? Visit the ' +
'<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
+ ' to learn more about interactive tables.';
element.innerHTML = '';
dataTable['output_type'] = 'display_data';
await google.colab.output.renderOutput(dataTable, element);
const docLink = document.createElement('div');
docLink.innerHTML = docLinkHtml;
element.appendChild(docLink);
}
</script>
</div>
데이터 그룹화
- groupby() 함수
- 그룹별 집계함수를 통해 피벗테이블 생성
1
| juice.groupby(by='Location').count()
|
|
Date |
Lemon |
Orange |
Temperature |
Leaflets |
Price |
Sold |
Revenue |
Location |
|
|
|
|
|
|
|
|
Beach |
16 |
17 |
17 |
17 |
17 |
17 |
17 |
17 |
Park |
15 |
15 |
15 |
15 |
14 |
15 |
15 |
15 |
<script>
const buttonEl =
document.querySelector('#df-464f987e-371e-4805-8b50-ee52bdc321e8 button.colab-df-convert');
buttonEl.style.display =
google.colab.kernel.accessAllowed ? 'block' : 'none';
async function convertToInteractive(key) {
const element = document.querySelector('#df-464f987e-371e-4805-8b50-ee52bdc321e8');
const dataTable =
await google.colab.kernel.invokeFunction('convertToInteractive',
[key], {});
if (!dataTable) return;
const docLinkHtml = 'Like what you see? Visit the ' +
'<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
+ ' to learn more about interactive tables.';
element.innerHTML = '';
dataTable['output_type'] = 'display_data';
await google.colab.output.renderOutput(dataTable, element);
const docLink = document.createElement('div');
docLink.innerHTML = docLinkHtml;
element.appendChild(docLink);
}
</script>
</div>
1 2 3 4
| import numpy as np
juice.groupby(by='Location')[['Lemon','Orange']].agg([max, min, sum, np.mean])
|
|
Lemon |
Orange |
|
max |
min |
sum |
mean |
max |
min |
sum |
mean |
Location |
|
|
|
|
|
|
|
|
Beach |
162 |
76 |
2020 |
118.823529 |
120 |
47 |
1402 |
82.470588 |
Park |
176 |
71 |
1697 |
113.133333 |
129 |
42 |
1158 |
77.200000 |
<script>
const buttonEl =
document.querySelector('#df-1d771570-530b-4471-945e-e1db8f8b3457 button.colab-df-convert');
buttonEl.style.display =
google.colab.kernel.accessAllowed ? 'block' : 'none';
async function convertToInteractive(key) {
const element = document.querySelector('#df-1d771570-530b-4471-945e-e1db8f8b3457');
const dataTable =
await google.colab.kernel.invokeFunction('convertToInteractive',
[key], {});
if (!dataTable) return;
const docLinkHtml = 'Like what you see? Visit the ' +
'<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
+ ' to learn more about interactive tables.';
element.innerHTML = '';
dataTable['output_type'] = 'display_data';
await google.colab.output.renderOutput(dataTable, element);
const docLink = document.createElement('div');
docLink.innerHTML = docLinkHtml;
element.appendChild(docLink);
}
</script>
</div>