Crawling

[udemy] Web Scraping with Python: BeautifulSoup, Requests & Selenium 학습 정리

bluebamus 2025. 2. 23.

1. Python Refresher: Data Structures

1) List Comprehensions

- 중복 for을 이용한 표현

'''
0   1  2  3  4
5   6  7  8  9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24

'''
task_list = []

for row in range(0,25,5):
    inner_list = []
    for column in range(row,row+5):
        inner_list.append(column)
    task_list.append(inner_list)

for row in task_list:
    #print(row)
    pass


# Use list comprehension to make the same 2-d List

new_list = [[column for column in range(row,row+5)]for row in range(0,25,5)]           # [ value which is to be added -- loop ]

for row in new_list:
    print(row)

2) if else and List Comprehensions

- for과 if를 사용한 표현

a = 20
if a == 20:
    print('a is 20')
else:
    print('a is not 20')

# Using in-line if else statements

#print('a is 20' if a == 20 else 'a is not 20')          # do something if condition else do something

b = True if a == 20 else False
print(b)

# Using in-line if else statements in list comprehensions

num = [value for value in range(-5,5)]
print(num)


positive_num = [value for value in num if value < 0]
print(positive_num)

3) 엑셀 다루기 - 쓰기

- To install XlsxWriter for writing to Excel files:

pip install XlsxWriter

from xlsxwriter import Workbook     # neccessary import

# make workbook

workbook = Workbook('first_file.xlsx')

# add work sheet

worksheet = workbook.add_worksheet()

# write function - parameters - ( row,column, value )

for row in range(200):
    worksheet.write(row,0,'Row Number')
    worksheet.write(row,1,row)

# close workbook

workbook.close()

4) 엑셀 다루기 - 읽기

- To install xlrd for reading Excel files:

pip install xlrd

import xlrd     # import

# open workbook

workbook = xlrd.open_workbook('first-file.xlsx')

# get sheet - method - sheet_by_index(index parameter)

worksheet = workbook.sheet_by_index(0)

# find total no of rows - .nrows

rows = worksheet.nrows

# read rows - row_values(row number)

for row in range(rows):
    first_col,second_col = worksheet.row_values(row)
    print(first_col,'    ',second_col)

2. 환경 설정

1) beautifulsoup4과 requests 설치

pip install beautifulsoup4 

pip install requests

3. Introduction to Beautiful Soup Python Library

1) BeautifulSoup features 파라미터

파서	설명	설치 필요 여부	특징
"lxml"	가장 빠르고 강력한 파서	pip install lxml 필요	속도가 빠르고, HTML과 XML 모두 지원
"html.parser"	Python 내장 HTML 파서	기본 제공	속도가 느리지만 추가 설치 필요 없음
"html5lib"	웹 브라우저처럼 HTML 해석	pip install html5lib 필요	가장 정확한 HTML 파싱 가능, 속도는 느림
"xml"	lxml을 사용한 XML 파싱	pip install lxml 필요	XML을 엄격하게 파싱
"lxml-xml"	lxml 기반 XML 파서 (동일)	pip install lxml 필요	"xml"과 동일
"html5lib"	웹 브라우저 스타일의 HTML 파싱	pip install html5lib 필요	구조가 엉망인 HTML도 정리 가능

2) 문서 데이터 가져오기

- meta 데이터

<head>
    <meta charset="UTF-8">
    <title>Tags</title>
</head>

- meta 데이터 가져오기

print(meta) 
# = <meta charset="utf-8">

print(meta.get('charset'))
# = UTF-8

print(meta['charset'])
# = UTF-8

- 속성 업데이트하기

body = soup.body

body['style'] = 'some style'

print(body['style'])

# = some style

- 특정 태그의 속성 업데이트하기

from bs4 import BeautifulSoup

def read_file():
    file = open('intro_to_soup_html.html')
    data = file.read()
    file.close()
    return data

soup = BeautifulSoup(read_file(),'lxml')

# Navigable strings

# string inside a tag   - .string

title = soup.title

#print(title)

#print(title.string)

# .replace_with("") function            -- navigable string

print(title)

title.string.replace_with("title has been changed")

print(title)

4. Navigating with Beautiful Soup - Going Down

- soup.contents는 "\n" 같은 텍스트 노드도 포함하지만, soup.head와 soup.body는 해당 태그 객체만 출력한다.
- .descendants: 태그 이후의 모든 하위 요소들을 반복 형태로 반환한다.

5. Regular Expressions with Python

1) re.compile(pattern, flags=0):

- re.compile() 함수는 정규 표현식 패턴을 컴파일하여 정규 표현식 객체를 반환한다.
- 이 함수를 사용하면 정규 표현식을 미리 컴파일하여 재사용할 수 있다. 여러 번 같은 패턴을 사용해야 할 경우, 컴파일된 객체를 사용하는 것이 성능상 이점을 가져올 수 있다.

- 예를 들어:

import re

pattern = re.compile(r'\b\w+\b')  # 단어 경계를 기준으로 한 단어에 매칭하는 패턴 컴파일
result = pattern.findall('Hello, world!')  # 컴파일된 패턴을 사용하여 매칭 작업 수행

2) match(pattern, string, flags=0):

- match() 함수는 주어진 문자열의 시작 부분에서 정규 표현식 패턴과 일치하는지 검사한다.

- 문자열의 시작에서부터 패턴이 일치해야 하며, 일치하는 경우에는 match 객체를 반환한다. 일치하지 않는 경우에는 None을 반환한다.

- 예를 들어:

import re

pattern = r'\b\w+\b'  # 단어 경계를 기준으로 한 단어에 매칭하는 패턴
string = 'Hello, world!'
match_obj = re.match(pattern, string)

if match_obj:
    print(f'Matched: {match_obj.group()}')  # 패턴과 일치하는 첫 번째 단어 출력
else:
    print('No match')

3) ^ 표시의 시작 의미, 부정 의미

- [] 안의 ^ 표시는 부정의 의미이다.

- re.compile('[^a-zA-Z]')는 소문자와 대문자 알파벳을 제외한 모든 문자와 매칭되는 정규 표현식 패턴을 의미한다.

- [] 밖의 ^ 표시는 시작의 의미이다.

- ^[a-zA-Z]는 문자열의 시작에서 알파벳으로 시작하는 부분과 매칭된다.

4) 특수 시퀀스

- 숫자에 매칭 (\d) -- 어떤 십진수도 매칭 -- [0-9]
- regex = re.compile('\d')

- 숫자가 아닌 문자에 매칭 (\D) -- 어떤 비 숫자문자도 매칭 -- [^0-9]
- regex = re.compile('\D')

- 공백 문자에 매칭 (\s)
- regex = re.compile('\s')

- 공백이 아닌 문자에 매칭 (\S)
- regex = re.compile('\S')

- 알파벳과 숫자 문자에 매칭 (\w) -- [a-zA-Z0-9_]
- regex = re.compile('\w')

- 알파벳과 숫자가 아닌 문자에 매칭 (\W) -- [^ a-zA-Z0-9_]
- regex = re.compile('\W')

5) *, +, ?, {n}, {n,}, {n,m}의 사용법

1. *: 이 메타 문자는 앞의 패턴이 0회 이상 반복되는 것을 나타낸다.

- 예를 들어, a*는 'a'가 0회 이상 반복되는 모든 문자열과 매칭된다.
- 패턴: a* -> 매칭 예시: "", "a", "aa", "aaa", ...

2. +: 이 메타 문자는 앞의 패턴이 1회 이상 반복되는 것을 나타낸다.

- 예를 들어, a+는 'a'가 1회 이상 반복되는 모든 문자열과 매칭된다.
- 패턴: a+ -> 매칭 예시: "a", "aa", "aaa", ...

3. ?: 이 메타 문자는 앞의 패턴이 0회 또는 1회 등장하는 것을 나타낸다.

- 예를 들어, a?는 'a'가 0회 또는 1회 등장하는 문자열과 매칭된다.
- 패턴: a? -> 매칭 예시: "", "a"

4. {n}: 이 메타 문자는 앞의 패턴이 정확히 n회 반복되는 것을 나타낸다.
- 예를 들어, a{3}는 'a'가 정확히 3회 반복되는 문자열과 매칭된다.
- 패턴: a{3} -> 매칭 예시: "aaa"

5. {n,}: 이 메타 문자는 앞의 패턴이 n회 이상 반복되는 것을 나타낸다.
- 예를 들어, a{2,}는 'a'가 2회 이상 반복되는 문자열과 매칭된다.
- 패턴: a{2,} -> 매칭 예시: "aa", "aaa", ...

6. {n,m}: 이 메타 문자는 앞의 패턴이 최소 n회에서 최대 m회까지 반복되는 것을 나타낸다.
- 예를 들어, a{2,4}는 'a'가 2회에서 4회 사이로 반복되는 문자열과 매칭된다.
- 패턴: a{2,4} -> 매칭 예시: "aa", "aaa", "aaaa"

6) ^, |, $ 정리

1. ^: 이 메타 문자는 패턴이 문자열의 시작에서 매칭되어야 함을 나타낸다.
- 예를 들어, ^word는 문자열의 시작에서 "word"와 정확히 일치해야 한다.

2. $: 이 메타 문자는 패턴이 문자열의 끝에서 매칭되어야 함을 나타낸다.

- 예를 들어, end$는 문자열의 끝에서 "end"와 정확히 일치해야 한다.

3. |: (또는 "or" 연산자)는 정규 표현식에서 사용되어 여러 패턴 중 하나와 매칭되어야 함을 나타낸다.
- cat|dog: "cat" 또는 "dog"와 매칭된다.
- apple|banana|orange: "apple", "banana", 또는 "orange"와 매칭된다.

6. Searching the Parse Tree Using Beautiful Soup

1) find():

- find() 메서드는 HTML 문서에서 첫 번째로 매칭되는 요소를 찾는다.
- 사용 방법:

- soup.find('div', class_='content')는 class가 "content"인 첫 번째 <div> 요소를 찾는다.

# 예제 HTML
# <html>
#   <body>
#     <div class="content">Content 1</div>
#     <div class="content">Content 2</div>
#   </body>
# </html>

from bs4 import BeautifulSoup

html_doc = """
<html>
  <body>
    <div class="content">Content 1</div>
    <div class="content">Content 2</div>
  </body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')
result = soup.find('div', class_='content')
print(result.text)  # 출력: Content 1

2) find_all():

- find_all() 메서드는 HTML 문서에서 매칭되는 모든 요소를 리스트 형태로 반환한다.

- 사용 방법:

- soup.find_all('div', class_='content')는 class가 "content"인 모든 <div> 요소를 리스트로 반환한다. 각 요소는 for 루프를 통해 접근할 수 있다.

from bs4 import BeautifulSoup

html_doc = """
<html>
  <body>
    <div class="content">Content 1</div>
    <div class="content">Content 2</div>
  </body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')
results = soup.find_all('div', class_='content')
for result in results:
    print(result.text)
# 출력:
# Content 1
# Content 2

3) 딕셔너리 파라미터를 사용한 find_all()

attr = {'class':'story'}
first_a = soup.find_all(attrs=attr)

4) limit 파라미터를 사용한 find_all()

a_tags = soup.find_all('a',limit=2)
print(a_tags)

5) class_ 파라미터를 사용한 find_all()

tags = soup.find_all(class_='story')

6) recursive (재귀적) 파라미터를 사용한 find_all()

title = soup.find_all('title',recursive=False)

- recursive=True: 기본값으로, 모든 하위 요소를 재귀적으로 탐색한다. 주어진 요소의 모든 자손 요소까지 탐색한다.
- recursive=False: 주어진 요소의 직계 자식 요소만 탐색한다. 즉, 해당 요소의 바로 아래 단계의 자식 요소만 검색한다.

7. Using Selenium to Handle AJAX & JavaScript Driven Web Pages

- 강의가 2018년 12월 마지막 업데이트라 크롬 크롤러 탐지에 걸린다. 아래 코드는 해당 문제를 해결한 코드이다.

from selenium import webdriver
from time import sleep
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options

# driver = webdriver.Chrome('/Users/waqarjoyia/Downloads/chromedriver')
# Chrome 옵션 설정
options = Options()
options.add_argument(
    "user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36"
)
options.add_argument("disable-blink-features=AutomationControlled")  # 자동화 탐지 방지
options.add_experimental_option(
    "excludeSwitches", ["enable-automation"]
)  # 자동화 표시 제거
options.add_experimental_option(
    "useAutomationExtension", False
)  # 자동화 확장 기능 사용 안 함

# 웹드라이버 자동 설치 및 설정
driver = webdriver.Chrome(
    service=Service(ChromeDriverManager().install()), options=options
)

driver.get("https://www.google.com")

# search tag using id
search_bar = driver.find_element("id", "APjFqb")

# input data
search_bar.send_keys("I want to learn web scraping")

# submit the form
search_bar.submit()

sleep(10)

driver.close()

8. Web Scraping Your Instagram Account

- instagram을 로그인하고, 크롤링하는 코드이지만, 역시나 오래된 코드이기에 문제가 있다.

- login 시도를 하는 코드까지만 손을 봤다. 나머지는 참고로 보면 좋을것 같다.

# 필요한 라이브러리들을 임포트
from bs4 import BeautifulSoup  # HTML 파싱을 위한 라이브러리
from selenium import webdriver  # 웹 브라우저 자동화를 위한 Selenium WebDriver
from time import sleep  # 실행 중 잠시 대기(sleep)하기 위한 모듈
from xlsxwriter import Workbook  # 엑셀 파일 작성을 위한 라이브러리
import os  # 운영체제 관련 기능(파일 및 폴더 관리 등)을 위한 모듈
import requests  # HTTP 요청을 보내기 위한 라이브러리
import shutil  # 파일 복사 및 이동 등의 작업을 위한 모듈
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By


# Instagram 이미지 및 캡션 스크래핑을 위한 App 클래스 정의
class App:
    def __init__(
        self,
        username="dataminer2060",
        password="WebScraper",
        target_username="dataminer2060",
        path="/Users/Lazar/Desktop/instaPhotos",
    ):
        """
        클래스 초기화 메소드
        :param username: Instagram 로그인 시 사용할 사용자 이름
        :param password: Instagram 로그인 시 사용할 비밀번호
        :param target_username: 스크래핑 대상 Instagram 계정의 사용자 이름
        :param path: 다운로드 받은 이미지와 캡션을 저장할 로컬 디렉토리 경로
        """
        self.username = username
        self.password = password
        self.target_username = target_username
        self.path = path
        # Chrome 옵션 설정
        self.options = Options()
        self.options.add_argument(
            "user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36"
        )
        self.options.add_argument(
            "disable-blink-features=AutomationControlled"
        )  # 자동화 탐지 방지
        self.options.add_experimental_option(
            "excludeSwitches", ["enable-automation"]
        )  # 자동화 표시 제거
        self.options.add_experimental_option(
            "useAutomationExtension", False
        )  # 자동화 확장 기능 사용 안 함

        # 웹드라이버 자동 설치 및 설정
        self.driver = webdriver.Chrome(
            service=Service(ChromeDriverManager().install()), options=self.options
        )
        # ChromeDriver의 경로를 지정하여 Selenium WebDriver 객체 생성
        # self.driver = webdriver.Chrome(
        #     "/Users/Lazar/Downloads/chromedriver"
        # )  # ChromeDriver 경로를 실제 환경에 맞게 수정해야 함.
        self.error = False
        self.main_url = "https://www.instagram.com"
        self.all_images = []  # 스크래핑한 이미지 정보를 저장할 리스트

        # Instagram 메인 페이지 접속
        self.driver.get(self.main_url)
        sleep(3)  # 페이지 로딩을 위해 잠시 대기

        # 로그인 시도
        self.log_in()

        # 로그인 성공 여부에 따라 이후 동작 수행
        if self.error is False:
            self.close_dialog_box()  # 로그인 후 나타나는 팝업창 닫기
            self.open_target_profile()  # 타겟 프로필로 이동

        if self.error is False:
            self.scroll_down()  # 페이지 스크롤을 통해 모든 이미지 로드

        if self.error is False:
            # 저장할 폴더가 없으면 생성
            if not os.path.exists(path):
                os.mkdir(path)
            self.downloading_images()  # 이미지 및 캡션 다운로드

        sleep(3)  # 다운로드 완료 후 잠시 대기
        self.driver.close()  # 브라우저 종료

    def write_captions_to_excel_file(self, images, caption_path):
        """
        스크래핑한 이미지의 캡션들을 엑셀 파일에 저장하는 함수
        :param images: 스크래핑한 이미지 데이터 (HTML 태그 객체 리스트)
        :param caption_path: 캡션 파일을 저장할 폴더 경로
        """
        print("writing to excel")
        # 엑셀 파일 생성 (경로와 파일명 지정)
        workbook = Workbook(os.path.join(caption_path, "captions.xlsx"))
        worksheet = workbook.add_worksheet()

        row = 0
        # 첫 번째 행에 헤더 작성
        worksheet.write(row, 0, "Image name")  # 이미지 파일 이름
        worksheet.write(row, 1, "Caption")  # 이미지 캡션
        row += 1

        # 각 이미지에 대해 파일 이름과 캡션 저장
        for index, image in enumerate(images):
            filename = "image_" + str(index) + ".jpg"
            try:
                caption = image["alt"]  # 이미지 태그의 alt 속성에서 캡션 추출
            except KeyError:
                caption = "No caption exists"  # 캡션이 없을 경우 처리
            worksheet.write(row, 0, filename)
            worksheet.write(row, 1, caption)
            row += 1

        workbook.close()  # 엑셀 파일 저장 및 종료

    def download_captions(self, images):
        """
        캡션들을 다운로드(저장)하는 함수
        :param images: 스크래핑한 이미지 데이터 (HTML 태그 객체 리스트)
        """
        # 캡션 파일을 저장할 하위 폴더 생성 (존재하지 않으면)
        captions_folder_path = os.path.join(self.path, "captions")
        if not os.path.exists(captions_folder_path):
            os.mkdir(captions_folder_path)
        # 캡션들을 엑셀 파일로 저장
        self.write_captions_to_excel_file(images, captions_folder_path)

        # 아래 주석 처리된 코드는 각 캡션을 개별 텍스트 파일로 저장하는 대체 방식임.
        """
        for index, image in enumerate(images):
            try:
                caption = image['alt']
            except KeyError:
                caption = 'No caption exists for this image'
            file_name = 'caption_' + str(index) + '.txt'
            file_path = os.path.join(captions_folder_path, file_name)
            link = image['src']
            with open(file_path, 'wb') as file:
                file.write(str('link:' + str(link) + '\n' + 'caption:' + caption).encode())
        """

    def downloading_images(self):
        """
        스크래핑한 이미지들을 로컬에 다운로드하는 함수
        """
        # 중복 이미지 제거를 위해 set() 사용 후 다시 list로 변환
        self.all_images = list(set(self.all_images))
        # 캡션 다운로드 함수 호출
        self.download_captions(self.all_images)
        print("Length of all images", len(self.all_images))
        # 각 이미지에 대해 다운로드 진행
        for index, image in enumerate(self.all_images):
            filename = "image_" + str(index) + ".jpg"
            image_path = os.path.join(self.path, filename)
            link = image["src"]  # 이미지 URL
            print("Downloading image", index)
            response = requests.get(
                link, stream=True
            )  # 이미지 데이터를 스트림 방식으로 요청
            try:
                # 파일 쓰기를 통해 이미지 저장
                with open(image_path, "wb") as file:
                    shutil.copyfileobj(
                        response.raw, file
                    )  # 응답 데이터를 파일에 복사 (원본 → 대상)
            except Exception as e:
                print(e)
                print("Could not download image number ", index)
                print("Image link -->", link)

    def scroll_down(self):
        """
        Instagram 페이지를 스크롤 다운하여 더 많은 이미지를 로드하는 함수
        """
        try:
            # 페이지 상단에 표시된 포스트 수 가져오기
            no_of_posts = self.driver.find_element_by_xpath(
                '//span[text()=" posts"]'
            ).text
            no_of_posts = no_of_posts.replace(" posts", "")
            no_of_posts = str(no_of_posts).replace(",", "")  # 예: "15,483" -> "15483"
            self.no_of_posts = int(no_of_posts)
            # 포스트 수가 12개 이상인 경우 스크롤 횟수 계산 (한 번에 12개씩 로드한다고 가정)
            if self.no_of_posts > 12:
                no_of_scrolls = int(self.no_of_posts / 12) + 3  # 추가 스크롤을 위해 +3
                try:
                    for value in range(no_of_scrolls):
                        # 현재 페이지의 HTML 소스 가져와서 BeautifulSoup으로 파싱
                        soup = BeautifulSoup(self.driver.page_source, "lxml")
                        # 모든 이미지 태그를 찾아 리스트에 추가
                        for image in soup.find_all("img"):
                            self.all_images.append(image)
                        # 자바스크립트를 이용하여 페이지 하단으로 스크롤
                        self.driver.execute_script(
                            "window.scrollTo(0, document.body.scrollHeight);"
                        )
                        sleep(2)  # 스크롤 후 로딩 대기
                except Exception as e:
                    self.error = True
                    print(e)
                    print("Some error occurred while trying to scroll down")
            sleep(10)  # 모든 스크롤 후 추가 대기 (이미지 로딩을 위한 시간)
        except Exception:
            print("Could not find no of posts while trying to scroll down")
            self.error = True

    def open_target_profile(self):
        """
        타겟 사용자의 프로필 페이지로 이동하는 함수
        """
        try:
            # 검색창 요소 찾기 (Instagram 검색창)
            search_bar = self.driver.find_element_by_xpath(
                '//input[@placeholder="Search"]'
            )
            search_bar.send_keys(self.target_username)  # 타겟 사용자 이름 입력
            target_profile_url = (
                self.main_url + "/" + self.target_username + "/"
            )  # 타겟 프로필 URL 생성
            self.driver.get(target_profile_url)  # 타겟 프로필 페이지로 이동
            sleep(3)  # 페이지 로딩 대기
        except Exception:
            self.error = True
            print("Could not find search bar")

    def close_dialog_box(self):
        """
        로그인 후 나타나는 대화상자(팝업)를 닫는 함수
        """
        # 현재 페이지를 다시 로드하여 팝업 발생을 최소화
        sleep(2)
        self.driver.get(self.driver.current_url)
        sleep(3)
        try:
            sleep(3)
            # "Not Now" 버튼을 찾아 클릭하여 팝업 창 닫기
            not_now_btn = self.driver.find_element_by_xpath('//*[text()="Not Now"]')
            sleep(3)
            not_now_btn.click()
            sleep(1)
        except Exception:
            # 만약 해당 팝업이 없으면 예외 발생을 무시
            pass

    def close_settings_window_if_there(self):
        """
        설정 창(다른 브라우저 탭 또는 팝업)이 열려있을 경우 이를 닫는 함수
        """
        try:
            # 두 번째 창(탭)이 열려 있다면 해당 창을 닫고, 다시 첫 번째 창으로 전환
            self.driver.switch_to.window(self.driver.window_handles[1])
            self.driver.close()
            self.driver.switch_to.window(self.driver.window_handles[0])
        except Exception as e:
            # 예외 발생 시 무시 (설정 창이 없는 경우)
            pass

    def log_in(self):
        """
        Instagram 로그인 과정을 수행하는 함수
        """
        print("log_in start")
        try:
            # "Log in" 링크를 찾아 클릭하여 로그인 페이지로 이동
            # log_in_button = self.driver.find_element_by_link_text("Log in")
            # log_in_button.click()
            sleep(3)
        except Exception:
            self.error = True
            print("Unable to find login button")
        else:
            try:
                # 사용자명 입력 필드 찾기 (전화번호, 사용자 이름 또는 이메일 입력란)
                # user_name_input = self.driver.find_element(
                #     "xpath", '//*[@id="loginForm"]/div[1]/div[1]/div/label/input'
                # )
                # user_name_input.send_keys(self.username)  # 사용자명 입력
                # sleep(1)
                # 웹 요소 찾기
                elements = self.driver.find_elements(
                    By.XPATH, '//*[@id="loginForm"]/div[1]/div[1]/div/label/input'
                )

                # 요소가 존재하는지 확인하여 출력
                if elements:
                    print("XPath exists on the page.")
                else:
                    print("XPath does not exist on the page.")

                user_name_input = WebDriverWait(self.driver, 10).until(
                    EC.presence_of_element_located(
                        (By.XPATH, '//*[@id="loginForm"]/div[1]/div[1]/div/label/input')
                    )
                )
                user_name_input.send_keys(self.username)

                # 비밀번호 입력 필드 찾기
                # password_input = self.driver.find_element(
                #     "xpath", '//*[@id="loginForm"]/div[1]/div[2]/div/label/input'
                # )
                # password_input.send_keys(self.password)  # 비밀번호 입력
                # sleep(1)
                password_input = WebDriverWait(self.driver, 10).until(
                    EC.presence_of_element_located(
                        (By.XPATH, '//*[@id="loginForm"]/div[1]/div[2]/div/label/input')
                    )
                )
                password_input.send_keys(self.password)

                # 사용자명 입력 필드에서 제출(submit)하여 로그인 요청 전송
                # user_name_input.submit()
                # sleep(1)
                bt_click = WebDriverWait(self.driver, 10).until(
                    EC.presence_of_element_located(
                        (By.XPATH, '//*[@id="loginForm"]/div[1]/div[2]/div/label/input')
                    )
                )
                login_button = WebDriverWait(self.driver, 10).until(
                    EC.element_to_be_clickable(
                        (By.XPATH, '//*[@id="loginForm"]/div[1]/div[3]/button')
                    )
                )
                login_button.click()

                # 로그인 후 다른 창이 뜨면 닫기
                self.close_settings_window_if_there()
            except Exception:
                print(
                    "Some exception occurred while trying to find username or password field"
                )
                self.error = True


# 메인 실행부: 이 스크립트가 직접 실행될 경우 App 클래스를 인스턴스화하여 전체 과정을 시작
if __name__ == "__main__":
    app = App()

# -------------------------------------------------------------------------------
# 추가 참고 사항:
# 1. 이 코드는 Instagram의 웹 인터페이스를 기반으로 작성되었습니다.
#    Instagram은 자주 업데이트되므로 XPath나 페이지 구조가 변경될 수 있으며,
#    이 경우 코드는 정상적으로 작동하지 않을 수 있습니다.
#
# 2. Instagram의 서비스 약관(Terms of Service)을 위반하지 않도록 주의해야 합니다.
#    특히, 대량의 데이터 스크래핑은 계정 정지 등의 제재를 받을 수 있으므로 실제 사용 시
#    Instagram API 사용이나 합법적인 방법을 고려하시기 바랍니다.
#
# 3. Selenium WebDriver 사용 시, ChromeDriver의 버전과 Chrome 브라우저의 버전이 일치해야 합니다.
#    해당 경로와 버전을 정확히 확인하고 설정하시기 바랍니다.
#
# 4. sleep() 함수를 사용하여 페이지 로딩 및 이미지 로드를 위한 충분한 대기 시간을 제공하고 있습니다.
#    네트워크 상황에 따라 이 시간은 조정이 필요할 수 있습니다.
#
# 5. 코드는 에러 발생 시 간단한 예외 처리를 하고 있으나,
#    실제 프로젝트에서는 보다 정교한 예외 처리 및 로깅을 구현하는 것이 좋습니다.
# -------------------------------------------------------------------------------

저작자표시 비영리 동일조건 (새창열림)

'Crawling' 카테고리의 다른 글

SCRAPY 프레임워크의 사용 방법 정리 (1) (0)	2025.02.27
[udemy] Web Scraping with BeautifulSoup, Selenium, Scrapy and Scrapy-Playwright. 4 Project-like Exercises + 4 Real Life Projects 학습 정리 (0)	2025.02.26
BeautifulSoup4 기본 사용방법 정리 (0)	2025.02.22
BeautifulSoup4 매뉴얼 정리 (0)	2025.02.06
requests 라이브러리 기본 메뉴얼 정리 (0)	2025.02.06