[닮은 배우 분류기] 데이터 적재

방법 선택

1천장이나 되는 사진을 어떻게 직접 다 들고올까..하고 생각해보니

간단하게 우선 웹 스크래핑이 떠올랐습니다.

학교에서 진행하는 프로그램으로 IOWA 대학의 이강표 교수님께 웹 스크래핑을 배웠던 경험이 있어

웹 스크래핑을 한 번 써볼까..생각도 했지만 이내 더 간단한 방법이 있음을 알게 됐습니다.

bing_image_downloader라는 놀라운 도구가 있었습니다!

이 라이브러리를 이용하면 Bing.com에서 검색한 결과에 나오는 이미지들을 다운로드할 수 있습니다.

bing_image_downloader 설치 방법

이 라이브러리의 설치 방법은 아래와 같습니다.

$ pip install bing-image-downloader

또는

$ git clone https://github.com/gurugaurav/bing_image_downloader
$ cd bing_image_downloader
$ pip install .

bing_image_downloader 사용법

아래와 같은 방법으로 사용 가능하고,

from bing_image_downloader import downloader
downloader.download(query_string, limit=100,  output_dir='dataset', adult_filter_off=True, force_replace=False, timeout=60, verbose=True)

전달해준 인자들은 다음과 같습니다.

query_string: String to be searched.
limit: (optional, default is 100) Number of images to download.
output_dir: (optional, default is 'dataset') Name of output dir.
adult_filter_off: (optional, default is True) Enable of disable adult filteration.
force_replace: (optional, default is False) Delete folder if present and start a fresh download.
timeout: (optional, default is 60) timeout for connection in seconds.
verbose: (optional, default is True) Enable downloaded message.

저는 실제 아래의 코드로 데이터를 다운로드했습니다.

query_list = ['박보검', '송중기', '차은우', '현빈', '김수현', '류준열', '박서준', '공유', '이병헌', '정우성']
for query in query_list:
    downloader.download(query, limit=1000,  output_dir=os.path.join('dataset', 'boy', 'full'), 
                        adult_filter_off=True, force_replace=False, timeout=60)

그런데 아무래도 검색 엔진에서 검색 결과로 나온 이미지들을 다운로드한 것이다보니

관련이 없는 이미지가 좀 많았습니다. 이런 이미지들은 직접 제거해야 합니다.

(포스트 다 썼으니 이제 손가락을 바쳐서 제대로 된 데이터셋을 얻어오겠습니다....)

'토이 프로젝트 > 닮은 배우 분류기' 카테고리의 다른 글

[닮은 배우 분류기] 닮은 배우 분류기 만들기 시작! (0)	2022.01.24

방법 선택

bing_image_downloader 설치 방법

bing_image_downloader 사용법

'토이 프로젝트 > 닮은 배우 분류기' 카테고리의 다른 글

티스토리툴바