php laravel 爬虫完整使用教程

发布时间：2023-01-14 15:10:18 所属栏目：PHP教程来源：

导读：　　phpspider是一个爬虫开发框架。使用本框架php爬虫，你不用了解爬虫的底层技术实现，爬虫被网站屏蔽、有些网站需要登录或验证码识别才能爬取等问题。简单几行PHP代码，就可以创建自己的爬虫，利用框架封装的多进程

　　phpspider是一个爬虫开发框架。使用本框架php爬虫，你不用了解爬虫的底层技术实现，爬虫被网站屏蔽、有些网站需要登录或验证码识别才能爬取等问题。简单几行PHP代码，就可以创建自己的爬虫，利用框架封装的多进程Worker类库，代码更简洁，执行效率更高速度更快。

　　一、安装

　　composer require owner888/phpspider

　　二、代码实例

　　use phpspider\core\phpspider;
　　$configs = array(
　　    'name' => '糗事百科',
　　    'domains' => array(
　　        'qiushibaike.com',
　　        'www.qiushibaike.com'
　　    ),
　　    'scan_urls' => array(
　　        'http://www.qiushibaike.com/'
　　    ),
　　    'content_url_regexes' => array(
　　        "http://www.qiushibaike.com/article/\d+"
　　    ),
　　    'list_url_regexes' => array(
　　        "http://www.qiushibaike.com/8hr/page/\d+\?s=\d+"
　　    ),
　　    'fields' => array(
　　        array(
　　            // 抽取内容页的文章内容
　　            'name' => "article_content",
　　            'selector' => "//*[@id='single-next-link']",
　　            'required' => true
　　        ),
　　        array(
　　            // 抽取内容页的文章作者
　　            'name' => "article_author",
　　            'selector' => "//div[contains(@class,'author')]//h2",
　　            'required' => true
　　        ),
　　    ),
　　);
　　$spider = new phpspider($configs);
　　$spider->start();
　　爬虫的整体框架就是这样, 首先定义了一个

　　spider = new phpspider(

　　spider->start();来配置并启动爬虫.

　　三、模拟登录

　　// 登录请求url
　　$login_url = "http://www.waduanzi.com/login?url=http%3A%2F%2Fwww.waduanzi.com%2F";
　　// 提交的参数
　　$params = array(
　　    "LoginForm[returnUrl]" => "http%3A%2F%2Fwww.waduanzi.com%2F",
　　    "LoginForm[username]" => "13712899314",
　　    "LoginForm[password]" => "854230",
　　    "yt0" => "登录",
　　);
　　// 发送登录请求
　　requests::post($login_url, $params);
　　// 登录成功后本框架会把Cookie保存到www.waduanzi.com域名下，我们可以看看是否是已经收集到Cookie了
　　$cookies = requests::get_cookies("www.waduanzi.com");
　　print_r($cookies); // 可以看到已经输出Cookie数组结构
　　// requests对象自动收集Cookie，访问这个域名下的URL会自动带上
　　// 接下来我们来访问一个需要登录后才能看到的页面
　　$url = "http://www.waduanzi.com/member";
　　$html = requests::get($url);
　　echo $html;     // 可以看到登录后的页面，非常棒
　　四、代理ip

　　//普通
　　$url = "http://www.epooll.com/archives/806/";
　　$contents = file_get_contents($url);
　　preg_match_all("/
　　(.*?)
　　/is", $content, $matchs);
　　print_r($matchs[0]);
　　//代理ip
　　$context = array(
　　    'http' => array(
　　        'proxy' => 'tcp://192.168.0.2:3128', //这里设置你要使用的代理ip及端口号
　　        'request_fulluri' => true,
　　    ),
　　);
　　$context = stream_context_create($context);
　　$html = file_get_contents("http://www.epooll.com/archives/806/", false, $context);
　　echo $html;
　　//需验证的代理ip
　　$auth = base64_encode('USER:PASS');   //LOGIN:PASSWORD 这里是代理服务器的账户名及密码
　　$context = array(
　　    'http' => array(
　　        'proxy' => 'tcp://192.168.0.2:3128', //这里设置你要使用的代理ip及端口号
　　        'request_fulluri' => true,
　　        'header' => "Proxy-Authorization: Basic $auth",
　　    ),
　　);
　　$context = stream_context_create($context);
　　$html = file_get_contents("http://www.epooll.com/archives/806/", false, $context);
　　echo $html;

（编辑：我爱制作网_池州站长网）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!

PHP daddslashes 方法	laravel语言包怎么拓展
新版mysql+apache+php	如何用php或js提取图片