[끄적이기3] Kaggle: Movies 1910-2024 (Metacritic) 매타크리틱 영화 데이터

2024. 9. 28. 17:04

Intro.

전세 처음으로 해보고, 대출알아보고하는데

진짜 진빠지고 힘들다.

전세사기보다, 전세 대출이 안나올까봐 더 걱정이였다.

~~(사기치기만해봐, 다 때려치고 지구 끝까지 복수하러 갈수있음,,,,)~~

역시나 이번에 배운건, 사람 한명의 말한마디 믿지말고, 더 여러사람 찾아서 내가 두눈으로 확인하는게 필요하다는거.

그리고 약간의 운.

(천사같은 은행원을 만나는것도 진짜 복불복인느낌... 구리지점 은행원분, 이사 완료하면 한번 기프티콘이라도 보내드리겠습니다 정말)

[SIMPLE EDA AND VIEW: Movies 1910-2024 (Metacritic)]

1. 데이터셋 설명

데이터셋:
메타크리틱. 영화 좋아하시는 분들은, 여러 영화 소개 유투버가 한번쯤 언급해서 들어봤을거다. 미국사이트이고, 모든 영화 평론가들이 영화를 평가한 사이트! 그 메타크리틱의 데이터를 모아놓은 데이터 셋이다. 무려 1910년부터 2024년까지 개봉된 16,000편 이상의 영화가 들어있다.

출처:
https://www.kaggle.com/datasets/kashifsahil/16000-movies-1910-2024-metacritic/data

16000+ Movies 1910-2024 (Metacritic)

Comprehensive Film Dataset with Ratings, Descriptions, and More

www.kaggle.com

나의 목적:
그냥 연도별로나 기간별로 가장 평점 좋은 영화는 뭐고 내가 뭘 봤을까 싶은생각으로 끄적였다.
~~거창하게 뭐 분석하고 하려는거 아니였어.. 그냥 보고싶었다...~~

2. 탐색적 데이터 분석 (EDA)

양이 좀 되다보니까 당연히 전처리 간략하게 해주었다.

# 1. reomve 'Unnamed: 0'
df_cleaned = df.drop(columns=['Unnamed: 0'])

# 2. NaN
df_cleaned = df_cleaned.dropna()

# 3. data type
# 'Release Date' -> date
df_cleaned['Release Date'] = pd.to_datetime(df_cleaned['Release Date'], errors='coerce')

# 'Rating' -> numeric
df_cleaned['Rating'] = pd.to_numeric(df_cleaned['Rating'], errors='coerce')

# 'No of Persons Voted' -> numeric
df_cleaned['No of Persons Voted'] = pd.to_numeric(df_cleaned['No of Persons Voted'], errors='coerce')

# 'Duration' -> min
def convert_duration_to_minutes(duration_str):
    match = re.match(r'(?:(\d+) h)?\s*(?:(\d+) m)?', duration_str)
    if not match:
        return None
    hours = int(match.group(1)) if match.group(1) else 0
    minutes = int(match.group(2)) if match.group(2) else 0
    return hours * 60 + minutes

df_cleaned['Duration'] = df_cleaned['Duration'].apply(lambda x: convert_duration_to_minutes(x) if isinstance(x, str) else None)

# 4. 'Release Date', 'Rating', 'No of Persons Voted', 'Duration'
df_cleaned = df_cleaned.dropna(subset=['Release Date', 'Rating', 'No of Persons Voted', 'Duration'])

# compare
original_shape = df.shape
cleaned_shape = df_cleaned.shape

comparison = {
    "origin row": original_shape[0],
    "origin column": original_shape[1],
    "cleaned row": cleaned_shape[0],
    "cleaned column": cleaned_shape[1],
    "removed row": original_shape[0] - cleaned_shape[0]
}

comparison

그리고나서 뭔가 내가 보고싶은형태로 간략하게 가공했따.

데이터가 연도별로있다보니 너무 많아서, 10년단위로 정리가 필요했다.

# decade
bins = [1970, 1980, 1990, 2000, 2010, 2020, 2030]  
labels = ['1970-1979', '1980-1989', '1990-1999', '2000-2009', '2010-2019', '2020-2029']

df_cleaned['Year'] = df_cleaned['Release Date'].dt.year
df_cleaned['Decade'] = pd.cut(df_cleaned['Year'], bins=bins, labels=labels)

# group by genres..
df_exploded = df_cleaned.copy()
df_exploded['Genres'] = df_exploded['Genres'].str.split(',')
df_exploded = df_exploded.explode('Genres')

decade_genre_stats = df_exploded.groupby(['Decade', 'Genres']).agg(
    Movies_Released=('Title', 'count'),
    Average_Rating=('Rating', 'mean'),
    Average_Duration=('Duration', 'mean'),
    Total_Voters=('No of Persons Voted', 'sum')
).reset_index()

decade_genre_stats

3. 분석 방법 및 인사이트

보고싶었던걸 간략하게 볼거댜.

시각화는 역시 뭐다 ?

plotly가 가장 깔끔스다

import plotly.express as px


# decade 
fig = px.bar(
    decade_genre_stats, 
    x='Genres', 
    y='Movies_Released', 
    color='Genres', 
    animation_frame='Decade', 
    title="Movies Released by Genre Across Decades",
    labels={'Movies_Released': 'Number of Movies Released'},
    height=600
)

fig.show()

흠... 여기에 동적인 plotly 차트를 넣어주려했는데, json이 너무 긴건지, 캐글에서 복붙해서 그른지 깨진댜....

~~혹시나 티스토리에 plotly차트 넣고싶으면 내 블로그 글을 보세여~~

~~https://simbbo-blog.tistory.com/158~~

[Python_stock] 스토캐스틱 지표 만들기 (Stochastic) + Plotly 반응형 그래프 티스토리 블로그에 올리기

진정하고,, 지난번에 계속 Plotly 그래프 못 올리다가 다시 코드 정리해서 올리게 되었다. 블로그 참조해서 했는데, 포스트를 올리고 나서야 그래프가 정확하게 올라간게 확인이 된다. 그냥 글쓰

simbbo-blog.tistory.com

암튼 자 시각화를 보면 아래처럼 나오는데

이런식으로 연도별로 어떤 장르의 영화가 많이 런칭 되었는지 볼수있는 시각화다.

액션, 코미디, 범죄, 드라마, 호러, 미스테리, 로맨스, 스릴러 역시나 많았구

서부영화나 뮤지컬, 스포츠 영화는 정말 요즘엔 잘 안보인다.

추가로 좀 상위 평점을 받은 영화들을 보려고 작업을 해줬다.

즉 ,10년(Decade) 별로 평균 이상의 투표 수를 받은 최고의 영화를 선정하는 작업이다.

df_exploded['Decade'] = df_exploded['Decade'].astype(str)

# average decade
average_votes_per_decade = df_exploded.groupby('Decade')['No of Persons Voted'].mean()

# above average
df_top_movies_filtered = df_exploded[df_exploded['No of Persons Voted'] >= df_exploded['Decade'].map(average_votes_per_decade)]

# top movie per decade
top_movies_per_decade_filtered = df_top_movies_filtered.loc[df_top_movies_filtered.groupby(['Decade', 'Genres'])['Rating'].idxmax()]

top_movies_info_filtered = top_movies_per_decade_filtered[['Decade', 'Genres', 'Title', 'Rating', 'Description', 'No of Persons Voted', 'Directed by', 'Written by']]

# if Title, Rating, Description, Directed by are same, then one
top_movies_info_filtered_grouped = top_movies_info_filtered.groupby(
    ['Decade', 'Title', 'Rating', 'Description', 'No of Persons Voted', 'Directed by', 'Written by']
).agg({'Genres': lambda x: ', '.join(x)}).reset_index()

top_movies_info_filtered_grouped

요로코롬 나오는데,, 시각화를 해보려했는데

맘에들진 않았지만..

# top 3 per decade
top_3_movies_per_decade = top_movies_info_filtered_grouped.groupby('Decade').apply(
    lambda x: x.nlargest(3, 'Rating')
).reset_index(drop=True)

# combine (director, rate, voe)
top_3_movies_per_decade['Movie Info'] = top_3_movies_per_decade.apply(
    lambda row: f"Title: {row['Title']}<br>Director: {row['Directed by']}<br>Rating: {row['Rating']}<br>Votes: {row['No of Persons Voted']}", axis=1
)

fig = px.bar(
    top_3_movies_per_decade, 
    x='Decade', 
    y='Rating', 
    color='Title', 
    hover_data=['Movie Info'],  # 영화 정보 표시
    title="Top 3 Movies by Rating for Each Decade",
    barmode='group',  # 같은 Decade에서 상위 3개 영화를 그룹화
    height=600,
    width=1000
)

fig.update_layout(
    xaxis_title="Decade",
    yaxis_title="Rating",
    hoverlabel=dict(
        bgcolor="white",
        font_size=12,
        font_family="Rockwell"
    )
)

# table
table = go.Figure(data=[go.Table(
    header=dict(values=["Decade", "Title", "Director", "Rating", "Votes"],
                fill_color='paleturquoise',
                align='left'),
    cells=dict(values=[top_3_movies_per_decade.Decade, 
                       top_3_movies_per_decade.Title, 
                       top_3_movies_per_decade['Directed by'], 
                       top_3_movies_per_decade.Rating, 
                       top_3_movies_per_decade['No of Persons Voted']],
               fill_color='lavender',
               align='left'))
])

# 그래프와 테이블 출력
fig.show()
table.show()

이렇게 10년마다 어떤 영화가 가장 높은 평점을 받았는지 (3개씩) 보여준다

그래프가 이쁘지 않아서 어쩔수없이 테이블로 보여줬다.

나름 영화 좋아하고, 되게많이 봤다고 자부하는 사람인데 저 영화들 하나도 본적이 없다..... 표지도 보고 모르겠더라..

4. 결론 및 다음 단계

결론:
사실 결론이라고 말하기에는,
오 그냥 영화데이터 있네! 장르의 흐름이 있을까? 뭐가 가장 좋은 평점 받았으려나?
딱 이정도로 보고싶었어서 본거기 때문에 짧게 끝내버렸다.
(거창하게 분석할 생각이없었다)
다음 단계:
음 내가 보았던 영화들을 직접 추출하고 평점을 어떻게 줬는지도 보고싶고,
음... 장르별,연도별로 어떤 영화가 인기가 있었는지 이정도는 추가로 해볼만 할것같다.

5. 캐글 링크

전체 풀 코드와 시각화는 아래 캐글을 직접 들어가서 보면 더 깔끔하다요.

https://www.kaggle.com/code/sungbos/simple-eda-and-view-movies-1910-2024-metacritic

SIMPLE EDA AND VIEW: Movies 1910-2024 (Metacritic)

Explore and run machine learning code with Kaggle Notebooks | Using data from 16000+ Movies 1910-2024 (Metacritic)

www.kaggle.com

저작자표시

'Data Analysis > 코드 끄적이기' 카테고리의 다른 글

[끄적이기 6] Python으로 나만의 포트폴리오 백테스팅 시스템 구축하기: yfinance (3)	2025.01.01
[끄적이기 5] Kaggle: Apple Stock Data and Key Affiliated Companies 애플 주식 데이터 분석 (9)	2024.10.12
[끄적이기4] ChatGPT API 활용해서 감성분석해보기 (3)	2024.10.05
[끄적이기2] Selenium THRID_PARTY_NOTICES.chromedriver Error 해결하기 (0)	2024.07.27
[끄적이기1] 구글 플레이스토어, 앱스토어 리뷰정보 크롤링 해보기 (3)	2024.07.20

simbbo blog