上次说了如果通过dockerfile创建docker的镜像,接下来老铁们咱们通过创建好的镜像,创建容器,并且将代码映射到容器当中去,最终完成多任务端app抓取系统。源码:https://github.com/limingios/dockerpython.git (源码/「docker实战篇」python的docker-docker系统管理-基础概念(27))
任务需求详解
需要抓取三款应用的抖音,快手,今日头条,具体需要抓取的内容
1.抓取抖音当前视频的作者数据
2.抓取快手当前视频的作者数据
3.抓取今日头条推荐板块新闻
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time : 2019/3/18 20:31
# @Author : Aries
# @Site :
# @File : handle_appium_docker.py
# @Software: PyCharm
import multiprocessing
import time
from appium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
def get_size(driver):
x = driver.get_window_size()['width']
y = driver.get_window_size()['height']
return (x, y)
def handle_appium(info):
cap = {
"platformName": "Android",
"platformVersion": "4.4.2",
"deviceName":info['device'],
"udid": info['device'],
"appPackage": info['appPackage'],
"appActivity": info['appActivity'],
"noReset": True,
"unicodeKeyboard": True,
"resetkeyboard": True
}
driver = webdriver.Remote("http://192.168.199.140:" + str(info["port"]) + "/wd/hub", cap)
l = get_size(driver)
x1 = int(l[0] * 0.5)
y1 = int(l[1] * 0.15)
y2 = int(l[1] * 0.9)
#抖音
if info["appPackage"] == "com.ss.andrpid.ugc.aweme":
#根据实际的我这里直接写//android,通过
if WebDriverWait(driver,60).until(lambda x:x.find_element_by_xpath("//android")):
while True:
# 初始鼠标位置,从哪里开始,结束时鼠标位置,到哪里结束
driver.swipe(x1,y1,x1,y2)
time.sleep(3)
#快手
if info["appPackage"] == "com.smile.gifmaker":
# 根据实际的我这里直接写//android
if WebDriverWait(driver, 60).until(lambda x: x.find_element_by_xpath("//android")):
while True:
# 初始鼠标位置,从哪里开始,结束时鼠标位置,到哪里结束
driver.swipe(x1, y1, x1, y2)
time.sleep(3)
#快手
if info["appPackage"] == "com.ss.android.article.news":
# 根据实际的我这里直接写//android
if WebDriverWait(driver, 60).until(lambda x: x.find_element_by_xpath("//android")):
while True:
#初始鼠标位置,从哪里开始,结束时鼠标位置,到哪里结束
driver.swipe(x1, y1, x1, y2)
time.sleep(3)
if __name__ =='__main__':
m_list = []
devices_list = [
{
"device": "192.168.199.133:5555",
"appPackage": "com.ss.android.ugc.aweme",
"appActivity": "com.ss.android.ugc.aweme.main.MainActivity",
"port": 4723,
"key": '抖音'
},
{
"device": "192.168.199.133:5555",
"appPackage": "com.smile.gifmaker",
"appActivity": "com.yxcorp.gifshow.HomeActivity",
"port": 4725,
"key": '快手'
},
{
"device": "192.168.199.133:5555",
"appPackage": "com.ss.android.article.news",
"appActivity": "com.ss.android.article.news.activity.SplashBadgeActivity",
"port": 4727,
"key": '今日头条'
}
]
for device in (devices_list):
m_list.append(multiprocessing.Process(target=handle_appium,args=(device,)))
for m1 in m_list:
m1.start()
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time : 2019/3/18 19:57
# @Author : Aries
# @Site :
# @File : decode_data.py
# @Software: PyCharm
import json
from handle_mongo import mongo_info
def response(flow):
#抖音
if 'aweme.snssdk.com/aweme/v1/feed' in flow.request.url:
douyin_data_dict = json.loads(flow.response.text)
for douyin_item in douyin_data_dict['aweme_list']:
mongo_info.insert_item(douyin_item)
#快手
elif 'api.gifshow.com/rest/n/feed/hot' in flow.request.url:
kuaishou_data_dict = json.loads(flow.response.text)
for kuaishou_item in kuaishou_data_dict['feeds']:
mongo_info.insert_item(kuaishou_item)
#今日头条
elif 'is.snssdk.com/api/news/feed' in flow.request.url:
jrtt_data_dict = json.loads(flow.response.text)
for kuaishou_item in jrtt_data_dict['feeds']:
mongo_info.insert_item(kuaishou_item)
部署工作
python 爬虫获取信息其实不难,最难的是部署环境上。
虚拟机直接用vagrant的方式
源码里面包括vagrant文件,想了解如何使用可以查看我的中级文章讲解很详细
镜像下载
下载mongodb的镜像,Appium的镜像,zhugeaming/python3-appium的镜像
1.mongodb的镜像
mkdir bitnami
cd bitnami
mkdir mongodb
docker run -d -v /path/to/mongodb-persistence:/root/bitnami -p 27017:27017 bitnami/mongodb:latest
2.Appium的镜像
docker search appium
#比较大1个多g,因为之前已经设置了加速器,根据自身的网速来进行下载。
docker pull appium/appium
3.zhugeaming/python3-appium的镜像
docker pull zhugeaming/python3-appium
1.vagrant创建的虚拟机都是通过virtual box
2.设置共享文件夹
这是windows的环境下
3.在虚拟机挂载,将共享的文件夹挂载到虚拟机里面来
记住这个共享文件夹的名称叫handle_docker
mkdir docker
cd docker
sudo yum update && sudo yum -y install kernel-headers kernel-devel
sudo mount -t vboxsf handle_docker /root/docker/
PS:基本文件都挂载好了,已经很晚了,下次咱们继续把环境跑起来。