7. 웹 API를 이용한 크롤링

파이썬을 이용한 머신러닝, 딥러닝 실전개발 입문

파이썬을 이용한 머신러닝, 딥러닝 실전개발 입문](http://wikibook.co.kr/python-machine-learning/) 책과 강의를 바탕으로한 코드 리뷰 및 정리입니다. 자세한 내용은 책과 강의를 참고해주세요.

link: github

2. 고급 스크레이핑

2-3. 웹 API로 데이터 추출하기

웹 API

어떤 사이트가 가지고 있는 기능을 외부에서도 쉽게 사용할 수 있게 공개한 것 (= 웹에 공개된 데이터 : OpenAPI)
XML , JSON 형식의 데이터를 쉽게 처리할 수 있음

웹 API를 사용해보기 - OpenWeatherMap의 날씨 정보

http://openweathermap.org

[sign up] 을 눌러 가입을 한 후 [API keys] 탭에서 API키를 확인할 수 있다.

import requests
import json
  
#API키를 지정한다. 
apikey="46301d0111991e197561e9a50eb87d7b"
  
#날씨를 확인할 도시 지정
cities=["Seoul,KR","Tokyo,JP", "New York,US"]
  
#API지정
api="http://api.openweathermap.org/data2.5/weather?q={city}&APPID={key}"
  
#K->C 함수
k2c=lambda k:k-273.15
  
#각 도시정보 추출
for name in cities:
    #API의 url 구성
    url=api.format(city=name,key=apikey)
    #API에 요청을 보내 데이터 추출
    r=requests.get(url)
    print(r.text)
    #결과를 json형식으로 변환
    data=json.loads(r.text)
      
    #결과출력
    print("+도시=",data["name"])
    print("|날씨=",data["weather"][0]["description"])
    print("|최저기온=",k2c(data["main"]["temp_min"]))
    print("|최고기온=",k2c(data["main"]["temp_max"]))
    print("|습도=",data["main"]["humidity"])
    print("|기압=",data["main"]["pressure"])
    print("|풍향=",data["wind"]["deg"])
    print("|풍속=",data["wind"]["speed"])
    print("")
      
      
      

아래는 api로 불러온 데이터 양식이다. 아래에서 추출하고 싶은 데이터만 추출해 오면 된다.

{
	"coord":{"lon":126.98,"lat":37.57},
	"weather":[{"id":801,"main":"Clouds","description":"few clouds","icon":"02d"}],
	"base":"stations",
	"main":{"temp":296.59,"pressure":1018,"humidity":31,"temp_min":295.15,"temp_max":297.15},
	"visibility":10000,
	"wind":{"speed":2.1,"deg":10},
	"clouds":{"all":20},"dt":1538026200,
	"sys":{"type":1,"id":7668,"message":0.0067,"country":"KR","sunrise":1537997032,"sunset":1538040108},
	"id":1835848,
	"name":"Seoul",
	"cod":200
}

국내에서 사용할 수 있는 웹 API

http://www.apistore.co.kr/api/apiList.do
http://mashup.or.kr/business/main/main.do

네이버, 카카오 개발자 센터

https://developers.naver.com/main
https://developers.kakao.com/

쇼핑정보

http://api.danawa.com/main/index.html
http://developer.auction.co.kr/

주소전환

http://www.juso.go.kr/openIndexPage.do
http://biz.epost.go.kr/customCenter/custom/custom_9.jsp?subGugun=sub_3&subGubun_1=cum_17&gubun=m07

HTML Parsing 시 주의점

어떠한 주소 / robots.txt : 이 사이트에서 얘기하는 이런거는 가져가 주시지 말라는 걸 적어놓음.

ex) http://www.naver.com/robots.txt

User-agent: *
Disallow: /
Allow : /$ 

ex)http://www.yes24.com/robots.txt

User-agent: *
Disallow: /templates/
Disallow: /member/
Allow: /

User-agent: Googlebot
Disallow: /*.pdf$

User-agent: AhrefsBot
Disallow: /
User-agent: MJ12bot
Disallow: /
User-agent: SemrushBot
Disallow: /
User-agent: Baiduspider
Disallow: /
User-agent: Ezooms
Disallow: /
User-agent: YandexBot
Disallow: /
User-agent: ltx71
Disallow: /

데이터를 수집하는 것에 대한 법적인 문제가 없는지 검토 필요.