인공지능/부스트캠프 Ai Tech 2022. 1. 21. 18:43

728x90

[python]6. pandas_2

In [1]:

from IPython.core.display import display, HTML

display(HTML("<style>.container { width:90% !important; }</style>"))

#창 맞추기위함

6. Pandas_2¶

Groupby 1¶

SQL groupby 명령어와 같음

split - apply - combine 순으로 연산

df.groupby("묶음 기준이 되는 columns")["대상이 되는 columns"].연산()

In [2]:

import pandas as pd
import numpy as np
from pandas import Series, DataFrame

In [4]:

ipl_data = {
    "Team": [
        "Riders",
        "Riders",
        "Devils",
        "Devils",
        "Kings",
        "kings",
        "Kings",
        "Kings",
        "Riders",
        "Royals",
        "Royals",
        "Riders",
    ],
    "Rank": [1, 2, 2, 3, 3, 4, 1, 1, 2, 4, 1, 2],
    "Year": [2014, 2015, 2014, 2015, 2014, 2015, 2016, 2017, 2016, 2014, 2015, 2017],
    "Points": [876, 789, 863, 673, 741, 812, 756, 788, 694, 701, 804, 690],
}

df = pd.DataFrame(ipl_data)

df

Out[4]:

	Team	Rank	Year	Points
0	Riders	1	2014	876
1	Riders	2	2015	789
2	Devils	2	2014	863
3	Devils	3	2015	673
4	Kings	3	2014	741
5	kings	4	2015	812
6	Kings	1	2016	756
7	Kings	1	2017	788
8	Riders	2	2016	694
9	Royals	4	2014	701
10	Royals	1	2015	804
11	Riders	2	2017	690

In [5]:

df.groupby("Team")["Points"].sum()

Out[5]:

Team
Devils    1536
Kings     2285
Riders    3049
Royals    1505
kings      812
Name: Points, dtype: int64

Hierarchical index¶

두 개의 column으로 groupby를 할 경우, index가 두개 생성

여기서는 Team, Year가 두칸의 index 구성이 됨

In [7]:

h_index = df.groupby(['Team','Year'])["Points"].sum()
h_index

Out[7]:

Team    Year
Devils  2014    863
        2015    673
Kings   2014    741
        2016    756
        2017    788
Riders  2014    876
        2015    789
        2016    694
        2017    690
Royals  2014    701
        2015    804
kings   2015    812
Name: Points, dtype: int64

unstack() : Group으로 묶여진 데이터를 matrix 형태로 전환해줌

In [8]:

h_index.unstack()

Out[8]:

Year	2014	2015	2016	2017
Team
Devils	863.0	673.0	NaN	NaN
Kings	741.0	NaN	756.0	788.0
Riders	876.0	789.0	694.0	690.0
Royals	701.0	804.0	NaN	NaN
kings	NaN	812.0	NaN	NaN

In [10]:

h_index.index

Out[10]:

MultiIndex([('Devils', 2014),
            ('Devils', 2015),
            ( 'Kings', 2014),
            ( 'Kings', 2016),
            ( 'Kings', 2017),
            ('Riders', 2014),
            ('Riders', 2015),
            ('Riders', 2016),
            ('Riders', 2017),
            ('Royals', 2014),
            ('Royals', 2015),
            ( 'kings', 2015)],
           names=['Team', 'Year'])

In [11]:

h_index["Devils":"Kings"]

Out[11]:

Team    Year
Devils  2014    863
        2015    673
Kings   2014    741
        2016    756
        2017    788
Name: Points, dtype: int64

In [13]:

h_index.reset_index() # index를 다시 만들어줌

Out[13]:

	Team	Year	Points
0	Devils	2014	863
1	Devils	2015	673
2	Kings	2014	741
3	Kings	2016	756
4	Kings	2017	788
5	Riders	2014	876
6	Riders	2015	789
7	Riders	2016	694
8	Riders	2017	690
9	Royals	2014	701
10	Royals	2015	804
11	kings	2015	812

swaplevel : Index level을 변경할 수 있음

In [14]:

h_index

Out[14]:

Team    Year
Devils  2014    863
        2015    673
Kings   2014    741
        2016    756
        2017    788
Riders  2014    876
        2015    789
        2016    694
        2017    690
Royals  2014    701
        2015    804
kings   2015    812
Name: Points, dtype: int64

sort_index
sort_values

In [19]:

h_index.sort_index(level=0) # Team index 기준으로 sort

Out[19]:

Team    Year
Devils  2014    863
        2015    673
Kings   2014    741
        2016    756
        2017    788
Riders  2014    876
        2015    789
        2016    694
        2017    690
Royals  2014    701
        2015    804
kings   2015    812
Name: Points, dtype: int64

In [20]:

h_index.sort_index(level=1) 

Out[20]:

Team    Year
Devils  2014    863
Kings   2014    741
Riders  2014    876
Royals  2014    701
Devils  2015    673
Riders  2015    789
Royals  2015    804
kings   2015    812
Kings   2016    756
Riders  2016    694
Kings   2017    788
Riders  2017    690
Name: Points, dtype: int64

In [21]:

h_index.sort_values()

Out[21]:

Team    Year
Devils  2015    673
Riders  2017    690
        2016    694
Royals  2014    701
Kings   2014    741
        2016    756
        2017    788
Riders  2015    789
Royals  2015    804
kings   2015    812
Devils  2014    863
Riders  2014    876
Name: Points, dtype: int64

In [15]:

h_index.swaplevel() # Year - Team 이 서로 위치를 바꿈

Out[15]:

Year  Team  
2014  Devils    863
2015  Devils    673
2014  Kings     741
2016  Kings     756
2017  Kings     788
2014  Riders    876
2015  Riders    789
2016  Riders    694
2017  Riders    690
2014  Royals    701
2015  Royals    804
      kings     812
Name: Points, dtype: int64

Hierarchical index는 Series 데이터 형태

level만 잘 지정하면 std, sum 등을 사용할 수 있다.

In [22]:

h_index.std(level=0)

Out[22]:

Team
Devils    134.350288
Kings      24.006943
Riders     88.567771
Royals     72.831998
kings            NaN
Name: Points, dtype: float64

In [24]:

h_index.sum(level=1)

Out[24]:

Year
2014    3181
2015    3078
2016    1450
2017    1478
Name: Points, dtype: int64

Groupby - gropued¶

Groupby에 의해 Split된 상태를 추출 가능함, 추출된 Group 정보에는 세가지 유형의 Apply가 가능, Apply는 column 별 적용

Aggregation : 요약된 통계정보를 추출해 줌
Transformation : 해당 정보를 변환해줌
Filteration : 특정 정보를 제거 하여 보여주는 필터링 기능

In [25]:

grouped = df.groupby("Team")
grouped # Generater 형태

Out[25]:

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000015CFD19BF48>

In [26]:

for name, group in grouped: # name : Key , group : Value
    print(name)
    print(group)

Devils
     Team  Rank  Year  Points
2  Devils     2  2014     863
3  Devils     3  2015     673
Kings
    Team  Rank  Year  Points
4  Kings     3  2014     741
6  Kings     1  2016     756
7  Kings     1  2017     788
Riders
      Team  Rank  Year  Points
0   Riders     1  2014     876
1   Riders     2  2015     789
8   Riders     2  2016     694
11  Riders     2  2017     690
Royals
      Team  Rank  Year  Points
9   Royals     4  2014     701
10  Royals     1  2015     804
kings
    Team  Rank  Year  Points
5  kings     4  2015     812

In [28]:

type(group)

Out[28]:

pandas.core.frame.DataFrame

get_group : 특정 Key 값을 가진 그룹의 정보만 추출 가능

In [29]:

grouped.get_group("Devils")

Out[29]:

	Team	Rank	Year	Points
2	Devils	2	2014	863
3	Devils	3	2015	673

aggreation

In [33]:

grouped.agg(max)

Out[33]:

	Rank	Year	Points
Team
Devils	3	2015	863
Kings	3	2017	788
Riders	2	2017	876
Royals	4	2015	804
kings	4	2015	812

In [30]:

grouped.agg(sum)

Out[30]:

	Rank	Year	Points
Team
Devils	5	4029	1536
Kings	5	6047	2285
Riders	7	8062	3049
Royals	5	4029	1505
kings	4	2015	812

In [31]:

import numpy as np
grouped.agg(np.mean)

Out[31]:

	Rank	Year	Points
Team
Devils	2.500000	2014.500000	768.000000
Kings	1.666667	2015.666667	761.666667
Riders	1.750000	2015.500000	762.250000
Royals	2.500000	2014.500000	752.500000
kings	4.000000	2015.000000	812.000000

In [32]:

grouped['Points'].agg([np.sum, np.mean, np.std])

Out[32]:

	sum	mean	std
Team
Devils	1536	768.000000	134.350288
Kings	2285	761.666667	24.006943
Riders	3049	762.250000	88.567771
Royals	1505	752.500000	72.831998
kings	812	812.000000	NaN

In [35]:

grouped.describe().T

Out[35]:

	Team	Devils	Kings	Riders	Royals	kings
Rank	count	2.000000	3.000000	4.000000	2.000000	1.0
	mean	2.500000	1.666667	1.750000	2.500000	4.0
	std	0.707107	1.154701	0.500000	2.121320	NaN
	min	2.000000	1.000000	1.000000	1.000000	4.0
	25%	2.250000	1.000000	1.750000	1.750000	4.0
	50%	2.500000	1.000000	2.000000	2.500000	4.0
	75%	2.750000	2.000000	2.000000	3.250000	4.0
	max	3.000000	3.000000	2.000000	4.000000	4.0
Year	count	2.000000	3.000000	4.000000	2.000000	1.0
	mean	2014.500000	2015.666667	2015.500000	2014.500000	2015.0
	std	0.707107	1.527525	1.290994	0.707107	NaN
	min	2014.000000	2014.000000	2014.000000	2014.000000	2015.0
	25%	2014.250000	2015.000000	2014.750000	2014.250000	2015.0
	50%	2014.500000	2016.000000	2015.500000	2014.500000	2015.0
	75%	2014.750000	2016.500000	2016.250000	2014.750000	2015.0
	max	2015.000000	2017.000000	2017.000000	2015.000000	2015.0
Points	count	2.000000	3.000000	4.000000	2.000000	1.0
	mean	768.000000	761.666667	762.250000	752.500000	812.0
	std	134.350288	24.006943	88.567771	72.831998	NaN
	min	673.000000	741.000000	690.000000	701.000000	812.0
	25%	720.500000	748.500000	693.000000	726.750000	812.0
	50%	768.000000	756.000000	741.500000	752.500000	812.0
	75%	815.500000	772.000000	810.750000	778.250000	812.0
	max	863.000000	788.000000	876.000000	804.000000	812.0

Transformation : 개별 데이터의 변환을 지원함, Grouped된 데이터 기준 Column 별로 동일한 적용. Aggregation과 달리 key값 별로 요약된 정보가 아님

In [36]:

df

Out[36]:

	Team	Rank	Year	Points
0	Riders	1	2014	876
1	Riders	2	2015	789
2	Devils	2	2014	863
3	Devils	3	2015	673
4	Kings	3	2014	741
5	kings	4	2015	812
6	Kings	1	2016	756
7	Kings	1	2017	788
8	Riders	2	2016	694
9	Royals	4	2014	701
10	Royals	1	2015	804
11	Riders	2	2017	690

In [37]:

score = lambda x: (x)
grouped.transform(score)

Out[37]:

	Rank	Year	Points
0	1	2014	876
1	2	2015	789
2	2	2014	863
3	3	2015	673
4	3	2014	741
5	4	2015	812
6	1	2016	756
7	1	2017	788
8	2	2016	694
9	4	2014	701
10	1	2015	804
11	2	2017	690

In [38]:

score = lambda x: (x.max())
grouped.transform(score)

Out[38]:

	Rank	Year	Points
0	2	2017	876
1	2	2017	876
2	3	2015	863
3	3	2015	863
4	3	2017	788
5	4	2015	812
6	3	2017	788
7	3	2017	788
8	2	2017	876
9	4	2015	804
10	4	2015	804
11	2	2017	876

In [39]:

score = lambda x: (x - x.mean()) / x.std() # Normalization
grouped.transform(score)

Out[39]:

	Rank	Year	Points
0	-1.500000	-1.161895	1.284327
1	0.500000	-0.387298	0.302029
2	-0.707107	-0.707107	0.707107
3	0.707107	0.707107	-0.707107
4	1.154701	-1.091089	-0.860862
5	NaN	NaN	NaN
6	-0.577350	0.218218	-0.236043
7	-0.577350	0.872872	1.096905
8	0.500000	0.387298	-0.770596
9	0.707107	-0.707107	-0.707107
10	-0.707107	0.707107	0.707107
11	0.500000	1.161895	-0.815759

In [40]:

grouped.transform(lambda x : x)

Out[40]:

	Rank	Year	Points
0	1	2014	876
1	2	2015	789
2	2	2014	863
3	3	2015	673
4	3	2014	741
5	4	2015	812
6	1	2016	756
7	1	2017	788
8	2	2016	694
9	4	2014	701
10	1	2015	804
11	2	2017	690

fiter : 특정 조건으로 데이터를 검색할 때 사용

In [43]:

df['Team'].value_counts()

Out[43]:

Riders    4
Kings     3
Royals    2
Devils    2
kings     1
Name: Team, dtype: int64

In [41]:

df.groupby("Team").filter(lambda x: len(x) >=3) # Team별로 3개 이상의 데이터가 존재하는 경우만

Out[41]:

	Team	Rank	Year	Points
0	Riders	1	2014	876
1	Riders	2	2015	789
4	Kings	3	2014	741
6	Kings	1	2016	756
7	Kings	1	2017	788
8	Riders	2	2016	694
11	Riders	2	2017	690

In [46]:

df.groupby("Team").filter(lambda x: x['Points'].mean() > 1000)

Out[46]:

	Team	Rank	Year	Points

In [49]:

df.groupby("Team").filter(lambda x: x['Points'].mean() > 760)

Out[49]:

	Team	Rank	Year	Points
0	Riders	1	2014	876
1	Riders	2	2015	789
2	Devils	2	2014	863
3	Devils	3	2015	673
4	Kings	3	2014	741
5	kings	4	2015	812
6	Kings	1	2016	756
7	Kings	1	2017	788
8	Riders	2	2016	694
11	Riders	2	2017	690

Data - case study¶

In [52]:

df_phone = pd.read_csv('phone_data.csv')
df_phone

Out[52]:

	index	date	duration	item	month	network	network_type
0	0	15/10/14 06:58	34.429	data	2014-11	data	data
1	1	15/10/14 06:58	13.000	call	2014-11	Vodafone	mobile
2	2	15/10/14 14:46	23.000	call	2014-11	Meteor	mobile
3	3	15/10/14 14:48	4.000	call	2014-11	Tesco	mobile
4	4	15/10/14 17:27	4.000	call	2014-11	Tesco	mobile
...	...	...	...	...	...	...	...
825	825	13/03/15 00:38	1.000	sms	2015-03	world	world
826	826	13/03/15 00:39	1.000	sms	2015-03	Vodafone	mobile
827	827	13/03/15 06:58	34.429	data	2015-03	data	data
828	828	14/03/15 00:13	1.000	sms	2015-03	world	world
829	829	14/03/15 00:16	1.000	sms	2015-03	world	world

830 rows × 7 columns

In [53]:

df_phone.dtypes # date가 object 형, string type

Out[53]:

index             int64
date             object
duration        float64
item             object
month            object
network          object
network_type     object
dtype: object

In [54]:

import dateutil

df_phone['date'] = df_phone['date'].apply(dateutil.parser.parse, dayfirst=True)
df_phone.head()

Out[54]:

	index	date	duration	item	month	network	network_type
0	0	2014-10-15 06:58:00	34.429	data	2014-11	data	data
1	1	2014-10-15 06:58:00	13.000	call	2014-11	Vodafone	mobile
2	2	2014-10-15 14:46:00	23.000	call	2014-11	Meteor	mobile
3	3	2014-10-15 14:48:00	4.000	call	2014-11	Tesco	mobile
4	4	2014-10-15 17:27:00	4.000	call	2014-11	Tesco	mobile

In [55]:

df_phone.dtypes

Out[55]:

index                    int64
date            datetime64[ns]
duration               float64
item                    object
month                   object
network                 object
network_type            object
dtype: object

In [57]:

df_phone.groupby("month")['duration'].sum()

Out[57]:

month
2014-11    26639.441
2014-12    14641.870
2015-01    18223.299
2015-02    15522.299
2015-03    22750.441
Name: duration, dtype: float64

In [58]:

df_phone.groupby("month")['duration'].mean()

Out[58]:

month
2014-11    115.823657
2014-12     93.260318
2015-01     88.894141
2015-02    113.301453
2015-03    225.251891
Name: duration, dtype: float64

In [59]:

df_phone.groupby("month")['duration'].sum().plot()

Out[59]:

<matplotlib.axes._subplots.AxesSubplot at 0x15cffc8c548>

In [60]:

df_phone[df_phone['item']=='call'].groupby("month")['duration'].sum().plot()

Out[60]:

<matplotlib.axes._subplots.AxesSubplot at 0x15cffd86ac8>

In [61]:

df_phone[df_phone['item']=='data'].groupby("month")['duration'].sum().plot()

Out[61]:

<matplotlib.axes._subplots.AxesSubplot at 0x15cffdde508>

In [63]:

df_phone.groupby(["month","item"])['duration'].count()

Out[63]:

month    item
2014-11  call    107
         data     29
         sms      94
2014-12  call     79
         data     30
         sms      48
2015-01  call     88
         data     31
         sms      86
2015-02  call     67
         data     31
         sms      39
2015-03  call     47
         data     29
         sms      25
Name: duration, dtype: int64

In [64]:

df_phone.groupby(["month","item"])['duration'].count().unstack()

Out[64]:

item	call	data	sms
month
2014-11	107	29	94
2014-12	79	30	48
2015-01	88	31	86
2015-02	67	31	39
2015-03	47	29	25

In [65]:

df_phone.groupby(["month","item"])['duration'].count().unstack().plot()

Out[65]:

<matplotlib.axes._subplots.AxesSubplot at 0x15cffe49ac8>

In [73]:

df_phone.groupby(["month","item"])['duration'].sum()

Out[73]:

month    item
2014-11  call    25547.000
         data      998.441
         sms        94.000
2014-12  call    13561.000
         data     1032.870
         sms        48.000
2015-01  call    17070.000
         data     1067.299
         sms        86.000
2015-02  call    14416.000
         data     1067.299
         sms        39.000
2015-03  call    21727.000
         data      998.441
         sms        25.000
Name: duration, dtype: float64

agg 예제

In [69]:

df_phone.groupby("month", as_index=False).agg({"duration" : "sum"})
#as_index=False , month를 index로 안쓰겠다.

Out[69]:

	month	duration
0	2014-11	26639.441
1	2014-12	14641.870
2	2015-01	18223.299
3	2015-02	15522.299
4	2015-03	22750.441

In [70]:

df_phone.groupby("month").agg({"duration" : "sum"}).reset_index()

Out[70]:

	month	duration
0	2014-11	26639.441
1	2014-12	14641.870
2	2015-01	18223.299
3	2015-02	15522.299
4	2015-03	22750.441

In [72]:

df_phone.groupby(['month','item']).agg(
    {
        "duration" : "sum",
        "network_type" : "count",
        "date" : "first"
    }
)

Out[72]:

		duration	network_type	date
month	item
2014-11	call	25547.000	107	2014-10-15 06:58:00
	data	998.441	29	2014-10-15 06:58:00
	sms	94.000	94	2014-10-16 22:18:00
2014-12	call	13561.000	79	2014-11-14 17:24:00
	data	1032.870	30	2014-11-13 06:58:00
	sms	48.000	48	2014-11-14 17:28:00
2015-01	call	17070.000	88	2014-12-15 20:03:00
	data	1067.299	31	2014-12-13 06:58:00
	sms	86.000	86	2014-12-15 19:56:00
2015-02	call	14416.000	67	2015-01-15 10:36:00
	data	1067.299	31	2015-01-13 06:58:00
	sms	39.000	39	2015-01-15 12:23:00
2015-03	call	21727.000	47	2015-02-12 20:15:00
	data	998.441	29	2015-02-13 06:58:00
	sms	25.000	25	2015-02-19 18:46:00

In [77]:

df_phone.groupby(['month','item']).agg({'duration' : [min, max, sum],
                                       'network_type' : "count",
                                       'date' : [min, 'first', 'nunique']})

Out[77]:

		duration			network_type	date
		min	max	sum	count	min	first	nunique
month	item
2014-11	call	1.000	1940.000	25547.000	107	2014-10-15 06:58:00	2014-10-15 06:58:00	104
	data	34.429	34.429	998.441	29	2014-10-15 06:58:00	2014-10-15 06:58:00	29
	sms	1.000	1.000	94.000	94	2014-10-16 22:18:00	2014-10-16 22:18:00	79
2014-12	call	2.000	2120.000	13561.000	79	2014-11-14 17:24:00	2014-11-14 17:24:00	76
	data	34.429	34.429	1032.870	30	2014-11-13 06:58:00	2014-11-13 06:58:00	30
	sms	1.000	1.000	48.000	48	2014-11-14 17:28:00	2014-11-14 17:28:00	41
2015-01	call	2.000	1859.000	17070.000	88	2014-12-15 20:03:00	2014-12-15 20:03:00	84
	data	34.429	34.429	1067.299	31	2014-12-13 06:58:00	2014-12-13 06:58:00	31
	sms	1.000	1.000	86.000	86	2014-12-15 19:56:00	2014-12-15 19:56:00	58
2015-02	call	1.000	1863.000	14416.000	67	2015-01-15 10:36:00	2015-01-15 10:36:00	67
	data	34.429	34.429	1067.299	31	2015-01-13 06:58:00	2015-01-13 06:58:00	31
	sms	1.000	1.000	39.000	39	2015-01-15 12:23:00	2015-01-15 12:23:00	27
2015-03	call	2.000	10528.000	21727.000	47	2015-02-12 20:15:00	2015-02-12 20:15:00	47
	data	34.429	34.429	998.441	29	2015-02-13 06:58:00	2015-02-13 06:58:00	29
	sms	1.000	1.000	25.000	25	2015-02-19 18:46:00	2015-02-19 18:46:00	17

In [87]:

grouped = df_phone.groupby('month').agg( {"duration" : [min, max, np.mean]})

In [88]:

grouped

Out[88]:

	duration
	min	max	mean
month
2014-11	1.0	1940.0	115.823657
2014-12	1.0	2120.0	93.260318
2015-01	1.0	1859.0	88.894141
2015-02	1.0	1863.0	113.301453
2015-03	1.0	10528.0	225.251891

In [89]:

grouped.columns = grouped.columns.droplevel(level=0)
grouped.columns

Out[89]:

Index(['min', 'max', 'mean'], dtype='object')

In [90]:

grouped.rename(columns={'min' : 'min_duration', 'max' : 'max_duration',
                       'mean' : 'mean_duration'})

Out[90]:

	min_duration	max_duration	mean_duration
month
2014-11	1.0	1940.0	115.823657
2014-12	1.0	2120.0	93.260318
2015-01	1.0	1859.0	88.894141
2015-02	1.0	1863.0	113.301453
2015-03	1.0	10528.0	225.251891

add_prefix¶

columns의 앞에 명칭을 추가

In [91]:

grouped = df_phone.groupby('month').agg( {"duration" : [min, max, np.mean]})
grouped

Out[91]:

	duration
	min	max	mean
month
2014-11	1.0	1940.0	115.823657
2014-12	1.0	2120.0	93.260318
2015-01	1.0	1859.0	88.894141
2015-02	1.0	1863.0	113.301453
2015-03	1.0	10528.0	225.251891

In [94]:

grouped.columns = grouped.columns.droplevel(level=0)
grouped.columns

Out[94]:

Index(['min', 'max', 'mean'], dtype='object')

In [95]:

grouped.add_prefix("duraction_")

Out[95]:

	duraction_min	duraction_max	duraction_mean
month
2014-11	1.0	1940.0	115.823657
2014-12	1.0	2120.0	93.260318
2015-01	1.0	1859.0	88.894141
2015-02	1.0	1863.0	113.301453
2015-03	1.0	10528.0	225.251891

Pivot table & Crosstab¶

Pivot table¶

Column에 추가로 labeling 값을 추가하고, Value에는 numeric type 값을 aggreation 하는 형태

Index 축은 groupby와 동일

In [96]:

df_phone

Out[96]:

	index	date	duration	item	month	network	network_type
0	0	2014-10-15 06:58:00	34.429	data	2014-11	data	data
1	1	2014-10-15 06:58:00	13.000	call	2014-11	Vodafone	mobile
2	2	2014-10-15 14:46:00	23.000	call	2014-11	Meteor	mobile
3	3	2014-10-15 14:48:00	4.000	call	2014-11	Tesco	mobile
4	4	2014-10-15 17:27:00	4.000	call	2014-11	Tesco	mobile
...	...	...	...	...	...	...	...
825	825	2015-03-13 00:38:00	1.000	sms	2015-03	world	world
826	826	2015-03-13 00:39:00	1.000	sms	2015-03	Vodafone	mobile
827	827	2015-03-13 06:58:00	34.429	data	2015-03	data	data
828	828	2015-03-14 00:13:00	1.000	sms	2015-03	world	world
829	829	2015-03-14 00:16:00	1.000	sms	2015-03	world	world

830 rows × 7 columns

In [97]:

df_phone.pivot_table(
    values=["duration"],
    index=[df_phone.month, df_phone.item],
    columns=df_phone.network,
    aggfunc="sum",
    fill_value=0 # NaN 값은 0으로 대체
)

Out[97]:

		duration
	network	Meteor	Tesco	Three	Vodafone	data	landline	special	voicemail	world
month	item
2014-11	call	1521	4045	12458	4316	0.000	2906	0	301	0
	data	0	0	0	0	998.441	0	0	0	0
	sms	10	3	25	55	0.000	0	1	0	0
2014-12	call	2010	1819	6316	1302	0.000	1424	0	690	0
	data	0	0	0	0	1032.870	0	0	0	0
	sms	12	1	13	18	0.000	0	0	0	4
2015-01	call	2207	2904	6445	3626	0.000	1603	0	285	0
	data	0	0	0	0	1067.299	0	0	0	0
	sms	10	3	33	40	0.000	0	0	0	0
2015-02	call	1188	4087	6279	1864	0.000	730	0	268	0
	data	0	0	0	0	1067.299	0	0	0	0
	sms	1	2	11	23	0.000	0	2	0	0
2015-03	call	274	973	4966	3513	0.000	11770	0	231	0
	data	0	0	0	0	998.441	0	0	0	0
	sms	0	4	5	13	0.000	0	0	0	3

In [99]:

# Pivot 과 유사
df_phone.groupby(["month", "item",'network'])["duration"].sum().unstack()

Out[99]:

	network	Meteor	Tesco	Three	Vodafone	data	landline	special	voicemail	world
month	item
2014-11	call	1521.0	4045.0	12458.0	4316.0	NaN	2906.0	NaN	301.0	NaN
	data	NaN	NaN	NaN	NaN	998.441	NaN	NaN	NaN	NaN
	sms	10.0	3.0	25.0	55.0	NaN	NaN	1.0	NaN	NaN
2014-12	call	2010.0	1819.0	6316.0	1302.0	NaN	1424.0	NaN	690.0	NaN
	data	NaN	NaN	NaN	NaN	1032.870	NaN	NaN	NaN	NaN
	sms	12.0	1.0	13.0	18.0	NaN	NaN	NaN	NaN	4.0
2015-01	call	2207.0	2904.0	6445.0	3626.0	NaN	1603.0	NaN	285.0	NaN
	data	NaN	NaN	NaN	NaN	1067.299	NaN	NaN	NaN	NaN
	sms	10.0	3.0	33.0	40.0	NaN	NaN	NaN	NaN	NaN
2015-02	call	1188.0	4087.0	6279.0	1864.0	NaN	730.0	NaN	268.0	NaN
	data	NaN	NaN	NaN	NaN	1067.299	NaN	NaN	NaN	NaN
	sms	1.0	2.0	11.0	23.0	NaN	NaN	2.0	NaN	NaN
2015-03	call	274.0	973.0	4966.0	3513.0	NaN	11770.0	NaN	231.0	NaN
	data	NaN	NaN	NaN	NaN	998.441	NaN	NaN	NaN	NaN
	sms	NaN	4.0	5.0	13.0	NaN	NaN	NaN	NaN	3.0

Crosstab¶

A,B의 상관관계 등을 표현할때 사용

Pivot table의 특수한 형태
User-Item Rating Matrix 등을 만들 때 사용가능

In [100]:

df_movie = pd.read_csv("https://raw.githubusercontent.com/TeamLab/machine_learning_from_scratch_with_python/master/code/ch5/data/movie_rating.csv")
df_movie.head()

Out[100]:

	critic	title	rating
0	Jack Matthews	Lady in the Water	3.0
1	Jack Matthews	Snakes on a Plane	4.0
2	Jack Matthews	You Me and Dupree	3.5
3	Jack Matthews	Superman Returns	5.0
4	Jack Matthews	The Night Listener	3.0

In [103]:

df_movie.pivot_table(
    values=['rating'],
    index=df_movie.critic,
    columns=df_movie.title,
    aggfunc="sum",
    fill_value=0)

Out[103]:

	rating
title	Just My Luck	Lady in the Water	Snakes on a Plane	Superman Returns	The Night Listener	You Me and Dupree
critic
Claudia Puig	3.0	0.0	3.5	4.0	4.5	2.5
Gene Seymour	1.5	3.0	3.5	5.0	3.0	3.5
Jack Matthews	0.0	3.0	4.0	5.0	3.0	3.5
Lisa Rose	3.0	2.5	3.5	3.5	3.0	2.5
Mick LaSalle	2.0	3.0	4.0	3.0	3.0	2.0
Toby	0.0	0.0	4.5	4.0	0.0	1.0

In [105]:

pd.crosstab(
    index=df_movie.critic,
    columns=df_movie.title,
    values=df_movie.rating,
    aggfunc="first",
).fillna(0)

Out[105]:

title	Just My Luck	Lady in the Water	Snakes on a Plane	Superman Returns	The Night Listener	You Me and Dupree
critic
Claudia Puig	3.0	0.0	3.5	4.0	4.5	2.5
Gene Seymour	1.5	3.0	3.5	5.0	3.0	3.5
Jack Matthews	0.0	3.0	4.0	5.0	3.0	3.5
Lisa Rose	3.0	2.5	3.5	3.5	3.0	2.5
Mick LaSalle	2.0	3.0	4.0	3.0	3.0	2.0
Toby	0.0	0.0	4.5	4.0	0.0	1.0

In [109]:

df_movie.groupby(["critic","title"]).agg({"rating" : "first"}).unstack()

Out[109]:

	rating
title	Just My Luck	Lady in the Water	Snakes on a Plane	Superman Returns	The Night Listener	You Me and Dupree
critic
Claudia Puig	3.0	NaN	3.5	4.0	4.5	2.5
Gene Seymour	1.5	3.0	3.5	5.0	3.0	3.5
Jack Matthews	NaN	3.0	4.0	5.0	3.0	3.5
Lisa Rose	3.0	2.5	3.5	3.5	3.0	2.5
Mick LaSalle	2.0	3.0	4.0	3.0	3.0	2.0
Toby	NaN	NaN	4.5	4.0	NaN	1.0

Merge & concat¶

Merge¶

두 개의 데이터를 하나로 합침

In [110]:

raw_data = {
    "subject_id": ["2", "2", "3", "4", "5", "7", "8", "9", "10", "11"],
    "test_score": [51, 15, 15, 61, 16, 14, 15, 1, 61, 16],
}
df_a = pd.DataFrame(raw_data, columns=["subject_id", "test_score"])
df_a

Out[110]:

	subject_id	test_score
0	2	51
1	2	15
2	3	15
3	4	61
4	5	16
5	7	14
6	8	15
7	9	1
8	10	61
9	11	16

In [111]:

raw_data = {
    "subject_id": ["4", "5", "6", "7", "8"],
    "first_name": ["Billy", "Brian", "Bran", "Bryce", "Betty"],
    "last_name": ["Bonder", "Black", "Balwner", "Brice", "Btisan"],
}
df_b = pd.DataFrame(raw_data, columns=["subject_id", "first_name", "last_name"])
df_b

Out[111]:

	subject_id	first_name	last_name
0	4	Billy	Bonder
1	5	Brian	Black
2	6	Bran	Balwner
3	7	Bryce	Brice
4	8	Betty	Btisan

In [112]:

pd.merge(df_a, df_b, on ="subject_id") # on : 공통 columns

Out[112]:

	subject_id	test_score	first_name	last_name
0	4	61	Billy	Bonder
1	5	16	Brian	Black
2	7	14	Bryce	Brice
3	8	15	Betty	Btisan

In [ ]:

pd.merge(df_a, df_b, left_on ="subject_id", right_on="subject_id")
# 양쪽에서 이름이 다른 columns을 기준으로 합칠때

In [113]:

pd.merge(df_a, df_b, on ="subject_id", how="right") 

Out[113]:

	subject_id	test_score	first_name	last_name
0	4	61.0	Billy	Bonder
1	5	16.0	Brian	Black
2	7	14.0	Bryce	Brice
3	8	15.0	Betty	Btisan
4	6	NaN	Bran	Balwner

index based join

In [116]:

df = pd.merge(df_a, df_b, right_index=True, left_index=True)
df

Out[116]:

	subject_id_x	test_score	subject_id_y	first_name	last_name
0	2	51	4	Billy	Bonder
1	2	15	5	Brian	Black
2	3	15	6	Bran	Balwner
3	4	61	7	Bryce	Brice
4	5	16	8	Betty	Btisan

concat¶

같은 형태의 데이터를 붙이는 작업

In [118]:

raw_data = {
    "subject_id": ["1", "2", "3", "4", "5"],
    "first_name": ["Alex", "Amy", "Allen", "Alice", "Ayoung"],
    "last_name": ["Anderson", "Ackerman", "Ali", "Aoni", "Atiches"],
}
df_a = pd.DataFrame(raw_data, columns=["subject_id", "first_name", "last_name"])
df_a

Out[118]:

	subject_id	first_name	last_name
0	1	Alex	Anderson
1	2	Amy	Ackerman
2	3	Allen	Ali
3	4	Alice	Aoni
4	5	Ayoung	Atiches

In [119]:

raw_data = {
    "subject_id": ["4", "5", "6", "7", "8"],
    "first_name": ["Billy", "Brian", "Bran", "Bryce", "Betty"],
    "last_name": ["Bonder", "Black", "Balwner", "Brice", "Btisan"],
}
df_b = pd.DataFrame(raw_data, columns=["subject_id", "first_name", "last_name"])
df_b

Out[119]:

	subject_id	first_name	last_name
0	4	Billy	Bonder
1	5	Brian	Black
2	6	Bran	Balwner
3	7	Bryce	Brice
4	8	Betty	Btisan

In [121]:

df_new = pd.concat([df_a , df_b])
df_new.reset_index(drop=True)

Out[121]:

	subject_id	first_name	last_name
0	1	Alex	Anderson
1	2	Amy	Ackerman
2	3	Allen	Ali
3	4	Alice	Aoni
4	5	Ayoung	Atiches
5	4	Billy	Bonder
6	5	Brian	Black
7	6	Bran	Balwner
8	7	Bryce	Brice
9	8	Betty	Btisan

In [122]:

df_new = pd.concat([df_a , df_b], axis=1)
df_new.reset_index(drop=True)

Out[122]:

	subject_id	first_name	last_name	subject_id	first_name	last_name
0	1	Alex	Anderson	4	Billy	Bonder
1	2	Amy	Ackerman	5	Brian	Black
2	3	Allen	Ali	6	Bran	Balwner
3	4	Alice	Aoni	7	Bryce	Brice
4	5	Ayoung	Atiches	8	Betty	Btisan

case¶

In [ ]:

'인공지능 > 부스트캠프 Ai Tech' 카테고리의 다른 글

[Pytorch]2. project template (0)	2022.01.28
[Pytorch]1. PyTorch Basics (0)	2022.01.28
[python]6. pandas_1 (0)	2022.01.21
[python]5. Numpy (0)	2022.01.21
[python]4. Python Data Handling (0)	2022.01.21

ABOUT ME

AI_RecSys AI_RecSys

6. Pandas_2¶

Groupby 1¶

Hierarchical index¶

Groupby - gropued¶

Data - case study¶

add_prefix¶

Pivot table & Crosstab¶

Pivot table¶

Crosstab¶

Merge & concat¶

Merge¶

concat¶

case¶

'인공지능 > 부스트캠프 Ai Tech' 카테고리의 다른 글

티스토리툴바

ABOUT ME

6. Pandas_2¶

Groupby 1¶

Hierarchical index¶

Groupby - gropued¶

Data - case study¶

add_prefix¶

Pivot table & Crosstab¶

Pivot table¶

Crosstab¶

Merge & concat¶

Merge¶

concat¶

case¶

'인공지능 > 부스트캠프 Ai Tech' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바