인공지능/부스트캠프 Ai Tech 2022. 1. 21. 18:38

728x90

In [1]:

from IPython.core.display import display, HTML

display(HTML("<style>.container { width:90% !important; }</style>"))

#창 맞추기위함

4. Python Data Handling¶

CSV, 웹(html), XML, JSON

CSV(Comma Separte Value)¶

엑셀 양식의 데이터를 프로그램에 상관없이 쓰기 위한 데이터 형식
탭(TSV), 빈칸(SSV) 등으로 구분하며 통칭하여 character-separated values(CSV)로 부름

In [1]:

line_counter = 0 # 파일의 총 줄수를 세는 변수
data_header = [] # data의 필드값을 저장하는 list
customer_list = [] # customer 개별 list를 저장하는 list

In [18]:

with open("customers.csv") as customer_data:
    while True:
        data = customer_data.readline() # customer.csv에 한줄씩 data 변수에 저장
        if not data: break
        if line_counter==0: # 첫번째 데이터는 데이터의 필드
            data_header = data.split(",")
        else: # 헤더가 아닌 데이터들
            customer_list.append(data.split(","))
        line_counter +=1

In [19]:

print("Header: \t", data_header)
for i in range(0, 10):
    print("Data",i,":\t", customer_list[i])
print(len(customer_list))

Header: 	 ['customerNumber', 'customerName', 'contactLastName', 'contactFirstName', 'phone', 'addressLine1', 'addressLine2', 'city', 'state', 'postalCode', 'country', 'salesRepEmployeeNumber', 'creditLimit\n']
Data 0 :	 ['103', '"Atelier graphique"', 'Schmitt', '"Carine "', '40.32.2555', '"54', ' rue Royale"', 'NULL', 'Nantes', 'NULL', '44000', 'France', '1370', '21000\n']
Data 1 :	 ['112', '"Signal Gift Stores"', 'King', 'Jean', '7025551838', '"8489 Strong St."', 'NULL', '"Las Vegas"', 'NV', '83030', 'USA', '1166', '71800\n']
Data 2 :	 ['114', '"Australian Collectors', ' Co."', 'Ferguson', 'Peter', '"03 9520 4555"', '"636 St Kilda Road"', '"Level 3"', 'Melbourne', 'Victoria', '3004', 'Australia', '1611', '117300\n']
Data 3 :	 ['119', '"La Rochelle Gifts"', 'Labrune', '"Janine "', '40.67.8555', '"67', ' rue des Cinquante Otages"', 'NULL', 'Nantes', 'NULL', '44000', 'France', '1370', '118200\n']
Data 4 :	 ['121', '"Baane Mini Imports"', 'Bergulfsen', '"Jonas "', '"07-98 9555"', '"Erling Skakkes gate 78"', 'NULL', 'Stavern', 'NULL', '4110', 'Norway', '1504', '81700\n']
Data 5 :	 ['124', '"Mini Gifts Distributors Ltd."', 'Nelson', 'Susan', '4155551450', '"5677 Strong St."', 'NULL', '"San Rafael"', 'CA', '97562', 'USA', '1165', '210500\n']
Data 6 :	 ['125', '"Havel & Zbyszek Co"', 'Piestrzeniewicz', '"Zbyszek "', '"(26) 642-7555"', '"ul. Filtrowa 68"', 'NULL', 'Warszawa', 'NULL', '01-012', 'Poland', 'NULL', '0\n']
Data 7 :	 ['128', '"Blauer See Auto', ' Co."', 'Keitel', 'Roland', '"+49 69 66 90 2555"', '"Lyonerstr. 34"', 'NULL', 'Frankfurt', 'NULL', '60528', 'Germany', '1504', '59700\n']
Data 8 :	 ['129', '"Mini Wheels Co."', 'Murphy', 'Julie', '6505555787', '"5557 North Pendale Street"', 'NULL', '"San Francisco"', 'CA', '94217', 'USA', '1165', '64600\n']
Data 9 :	 ['131', '"Land of Toys Inc."', 'Lee', 'Kwai', '2125557818', '"897 Long Airport Avenue"', 'NULL', 'NYC', 'NY', '10022', 'USA', '1323', '114900\n']
860

USA customer 만 뽑기¶

In [7]:

line_counter = 0 # 파일의 총 줄수를 세는 변수
data_header = [] # data의 필드값을 저장하는 list
employee = []
customer_USA_only_list = []
customer = None

In [8]:

with open("customers.csv") as customer_data:
    while True:
        data = customer_data.readline() # customer.csv에 한줄씩 data 변수에 저장
        if not data: break
        if line_counter==0: # 첫번째 데이터는 데이터의 필드
            data_header = data.split(",")
        else:
            customer = data.split(",")
            if customer[10].upper() == "USA": # 미국 국적인 사람만 뽑는다
                customer_USA_only_list.append(customer)
        line_counter +=1

In [10]:

print("Header: \t", data_header)
for i in range(0, 10):
    print("Data",i,":\t\t", customer_USA_only_list[i])
print(len(customer_USA_only_list))

Header: 	 ['customerNumber', 'customerName', 'contactLastName', 'contactFirstName', 'phone', 'addressLine1', 'addressLine2', 'city', 'state', 'postalCode', 'country', 'salesRepEmployeeNumber', 'creditLimit\n']
Data 0 :		 ['112', '"Signal Gift Stores"', 'King', 'Jean', '7025551838', '"8489 Strong St."', 'NULL', '"Las Vegas"', 'NV', '83030', 'USA', '1166', '71800\n']
Data 1 :		 ['124', '"Mini Gifts Distributors Ltd."', 'Nelson', 'Susan', '4155551450', '"5677 Strong St."', 'NULL', '"San Rafael"', 'CA', '97562', 'USA', '1165', '210500\n']
Data 2 :		 ['129', '"Mini Wheels Co."', 'Murphy', 'Julie', '6505555787', '"5557 North Pendale Street"', 'NULL', '"San Francisco"', 'CA', '94217', 'USA', '1165', '64600\n']
Data 3 :		 ['131', '"Land of Toys Inc."', 'Lee', 'Kwai', '2125557818', '"897 Long Airport Avenue"', 'NULL', 'NYC', 'NY', '10022', 'USA', '1323', '114900\n']
Data 4 :		 ['151', '"Muscle Machine Inc"', 'Young', 'Jeff', '2125557413', '"4092 Furth Circle"', '"Suite 400"', 'NYC', 'NY', '10022', 'USA', '1286', '138500\n']
Data 5 :		 ['157', '"Diecast Classics Inc."', 'Leong', 'Kelvin', '2155551555', '"7586 Pompton St."', 'NULL', 'Allentown', 'PA', '70267', 'USA', '1216', '100600\n']
Data 6 :		 ['161', '"Technics Stores Inc."', 'Hashimoto', 'Juri', '6505556809', '"9408 Furth Circle"', 'NULL', 'Burlingame', 'CA', '94217', 'USA', '1165', '84600\n']
Data 7 :		 ['168', '"American Souvenirs Inc"', 'Franco', 'Keith', '2035557845', '"149 Spinnaker Dr."', '"Suite 101"', '"New Haven"', 'CT', '97823', 'USA', '1286', '0\n']
Data 8 :		 ['173', '"Cambridge Collectables Co."', 'Tseng', 'Jerry', '6175555555', '"4658 Baden Av."', 'NULL', 'Cambridge', 'MA', '51247', 'USA', '1188', '43400\n']
Data 9 :		 ['175', '"Gift Depot Inc."', 'King', 'Julie', '2035552570', '"25593 South Bay Ln."', 'NULL', 'Bridgewater', 'CT', '97562', 'USA', '1323', '84300\n']
34

In [11]:

with open("customers_USA_only.csv", "w") as customer_USA_only_csv:
    for customer in customer_USA_only_list:
        customer_USA_only_csv.write(",".join(customer).strip('\n') + '\n')

In [12]:

customer_USA_only_list[0]

Out[12]:

['112',
 '"Signal Gift Stores"',
 'King',
 'Jean',
 '7025551838',
 '"8489 Strong St."',
 'NULL',
 '"Las Vegas"',
 'NV',
 '83030',
 'USA',
 '1166',
 '71800\n']

csv 객체¶

Text 파일 형태로 데이터 처리시 문장 내에 들어가 있는 ","등에 대해 전처리 과정이 필요한데 이를 위해 파이썬에서 csv 객체를 지원함

lineterminator : 줄 바꿈 기준 , \r\n(기본값)
quotechar : 문자열을 둘러싸는 신호 문자 "(기본값)
quoting : 데이터를 나누는 기준이 quotechar에 의해 둘러싸인 레벨 QUOTE_MINIMAL(기본값)

In [22]:

import csv

seoung_nam_data = []
header = []
rownum = 0

In [13]:

reader = csv.reader(f,
                   delimiter=',' , quotechar='"',
                   quoting=csv.QUOTE_ALL) # quotation이 전부 되어있는 상태

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-13-8f71f5cd3d17> in <module>
      1 import csv
      2 
----> 3 reader = csv.reader(f,
      4                    delimiter=',' , quotechar='"',
      5                    quoting=csv.QUOTE_ALL) # quotation이 전부 되어있는 상태

NameError: name 'f' is not defined

In [27]:

with open("foot_korean.csv","r", encoding='utf8') as p_file:
    csv_data = csv.reader(p_file)
    for row in csv_data:
        if rownum ==0: 
            header = row
        location = row[7]
        #행정구역 필드 데이터 추출, 한글 처리로 유니코드 데이터를 cp949로
        if location.find(u"성남시") != -1: # u는 유니코드의 약자
            seoung_nam_data.append(row)
        rownum +=1

In [28]:

with open("seoung_nam_floating_population_data.csv", "w", encoding='utf8') as s_p_file:
    writer = csv.writer(s_p_file, delimiter='\t', quotechar="'", quoting=csv.QUOTE_ALL)
    # csv.writer를 사용해서 csv 파일 만들기, delimiter 필드 구분자
    # quoatechar는 필드 각 데이터를 묶는 문자, quotating은 묶는 범위
    writer.writerow(header)
    for row in seoung_nam_data:
        writer.writerow(row)
        

Web¶

HTML¶

웹 상의 정보를 구조적으로 표현하기 위한 언어

모든 HTML은 트리 모양의 포함관계를 가짐

정규식(Regular Expression)¶

정규표현식, regexp 또는 regex 등으로 불림

복잡한 문자열 패턴을 정의하는 문자 표현 공식
특정한 규칙을 가진 문자열의 집합을 추출

정규식 연습장 활용하기¶

1) 정규식 연습장(http://www.regexr.com/) 으로 이동

2) 테스트하고 싶은 문서를 Text란에 삽입

3) 정규식을 사용해서 찾아보기

정규식 기본 문법 1¶

문자 클래스 [ ] : [와] 사이의 문자들과 매치라는 의미

ex) [abc] <- 해당 글자가 a,b,c 중 하나가 있다.

"-"를 사용하여 범위를 지정할 수 있다.

ex) [a-zA-z] - 알파벳 전체, [0-9] - 숫자 전체

정규식 기본 문법 - 메타 문자¶

. - 줄바꿈 문자인 \n을 제외한 모든 문자와 매치 a[.]b
'*' - 앞에 있는 글자를 반복해서 나올 수 있음

ex) tmor*ow : tomorrow, tomoow, tomorrrrow

'+' - 앞에 있는 글자를 최소 1회 이상 반복
{m.n} - 반복 횟수를 지정

ex) {1,}, {0,} {1,3} , [0-9]{1,3} \d{1,3}

? - 반복 횟수가 1회

ex) 01[01]?-[0-9]{4}-[0-9]{4}

| - or

ex) (0|1){3}

^ - not

정규식 in 파이썬¶

re 모듈을 import 하여 사용

함수 search - 한 개만 찾기, findall - 전체 찾기
추출된 패턴은 tuple로 반환됨
ID패턴 : [영문대소문자|숫자] 여러개, 별표로 끝남

In [2]:

# ID 패턴 : ex) "[A-Za-z0-9]+\*\*\*)" 정규식

import re
import urllib.request

url = "https://bit.ly/3rxQFS4"
html = urllib.request.urlopen(url)
html_contents = str(html.read())
id_results = re.findall(r"([A-Za-z0-9]+\*\*\*)", html_contents)

In [3]:

for result in id_results:
    print(result)

codo***
outb7***
dubba4***
multicuspi***
crownm***
triformo***
spania***
magazin***
presby***
trophody***
nontr***
enranck***
canc***
uncanker***
wrymo***
non***
luminat***
oblig***
anna***
hyperth***
toplabl***
dolce0***
rudals2***
jjw980***
elvlz***
skmid***
qkep***
kisslov***
maskman***
sungt***

XML¶

데이터의 구조와 의미를 TAG(MarkUp)을 사용하여 표시하는 언어
Tree 구조의 형태의 데이터임
정규표현식으로도 Parsing이 가능하지만 BeautifulSoup을 많이 사용

XML을 읽는 법을 알아야 어느정도 다룰 수 있으니 시간날때 공부 잠깐 해보자.

ex) 이중으로 짜져 있는 경우 이중으로 코드를 짜서 풀어내야 함

In [2]:

from bs4 import BeautifulSoup

soup = BeautifulSoup(books_xml, "lxml") # 객체 생성, (어떤파일, 어떤 parser)

soup.find_all("author") # Tag 찾는 함수 find_all 생성, "author"를 찾아라

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
C:\Users\Public\Documents\ESTsoft\CreatorTemp/ipykernel_16248/8450493.py in <module>
      1 from bs4 import BeautifulSoup
      2 
----> 3 soup = BeautifulSoup(books_xml, "lxml") # 객체 생성, (어떤파일, 어떤 parser)
      4 
      5 soup.find_all("author") # Tag 찾는 함수 find_all 생성

NameError: name 'books_xml' is not defined

BeautifulSoup 모듈 사용¶

find_all : 정규식과 마찬가지로 해당 패턴을 모두 반환
find('invention-title')
get_text() : 반환된 패턴의 값 반환(태그와 태그 사이)

XML 이중으로 된 구조는 이중으로 뽑아내야 한다.

In [ ]:

#patent_ex_2

import urllib
from bs4 import BeautifulSoup

with open("US08621662-20140107.XML", "r", encoding="utf8") as patent_xml:
    xml = patent_xml.read()  # File을 String으로 읽어오기

soup = BeautifulSoup(xml, "xml")  # xml parser 호출

invention_title_tag = soup.find("invention-title")
print(invention_title_tag.get_text())

publication_reference_tag = soup.find("publication-reference")
# 이중 구조 뽑아내기, 이중으로 짜있는건 이중으로 코드를 짜야 한다.
p_document_id_tag = publication_reference_tag.find("document-id")
p_country = p_document_id_tag.find("country").get_text()
p_doc_number = p_document_id_tag.find("doc-number").get_text()
p_kind = p_document_id_tag.find("kind").get_text()
p_date = p_document_id_tag.find("date").get_text()

print(p_doc_number, p_kind, p_date)

application_reference_tag = soup.find("application-reference")
a_document_id_tag = publication_reference_tag.find("document-id")
a_country = p_document_id_tag.find("country").get_text()
a_doc_number = p_document_id_tag.find("doc-number").get_text()
a_date = p_document_id_tag.find("date").get_text()

print(a_country, a_doc_number, a_date)

JSON(JavaScript Object Notation)¶

간결성으로 이해하기가 쉬움

json 모듈을 사용해 손 쉽게 파싱 및 저장 가능
데이터 저장 및 읽기는 dict type과 상호 호환 가능
웹에서 제공하는 API는 대부분 정보 교환시 JSON 활용

JSON Read¶

JSON 파일의 구조 확인 -> 읽어온 후 -> Dict Type처럼 처리

In [ ]:

import json

with open("json_example.json", "r", encoding="utf8") as f:
    contents = f.read()
    json_data = json.loads(contents) # Dict type으로 변환 및 Read
    print(type(json_data))
    
for employee in json_data["employees"]:
    print(employee["lastName"]) # employees 안에 lastName 꺼내오기

JSON Write¶

Dict Type으로 데이터 저장 -> JSON 모듈로 Write

In [10]:

import json

dict_data = {'Name' :'Zara', 'Age':7, 'Class': 'First'}

with open("data.json", "w") as f:
    json.dump(dict_data, f) # dump로 Write

In [ ]:

'인공지능 > 부스트캠프 Ai Tech' 카테고리의 다른 글

[python]6. pandas_1 (0)	2022.01.21
[python]5. Numpy (0)	2022.01.21
[python]3. Exception_File_LogHandling (0)	2022.01.21
[python]0. Pythonic code (0)	2022.01.21
[python]2. Module and Project (0)	2022.01.21

ABOUT ME

AI_RecSys AI_RecSys

4. Python Data Handling¶

CSV(Comma Separte Value)¶

USA customer 만 뽑기¶

csv 객체¶

Web¶

HTML¶

정규식(Regular Expression)¶

정규식 연습장 활용하기¶

정규식 기본 문법 1¶

정규식 기본 문법 - 메타 문자¶

정규식 in 파이썬¶

XML¶

BeautifulSoup 모듈 사용¶

JSON(JavaScript Object Notation)¶

JSON Read¶

JSON Write¶

'인공지능 > 부스트캠프 Ai Tech' 카테고리의 다른 글

티스토리툴바

ABOUT ME

4. Python Data Handling¶

CSV(Comma Separte Value)¶

USA customer 만 뽑기¶

csv 객체¶

Web¶

HTML¶

정규식(Regular Expression)¶

정규식 연습장 활용하기¶

정규식 기본 문법 1¶

정규식 기본 문법 - 메타 문자¶

정규식 in 파이썬¶

XML¶

BeautifulSoup 모듈 사용¶

JSON(JavaScript Object Notation)¶

JSON Read¶

JSON Write¶

'인공지능 > 부스트캠프 Ai Tech' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바