怎样把爬虫流量转化有效流量

  • 怎样把爬虫流量转化有效流量已关闭评论
  • 752 views
  • A+
所属分类:未分类

怎样把爬虫流量转化有效流量

网站从昨天开始,过一会就挂了,查看nginx日志一下,我去,到处都是爬虫来抓数据,nginx日志如下:

{"@timestamp":"2018-07-06T11:09:43+08:00","host":"172.18.41.187","clientip":"95.216.0.38","size":16129,"responsetime":0.803,"upstreamtime":"0.803","upstreamhost":"127.0.0.1:9000","http_host":"it.baiked.com","uri":"/index.php","query_string":"wpzmaction=add&postid=2071","request_method":"GET","xff":"-","referer":"-","agent":"Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)","status":"200"}
{"@timestamp":"2018-07-06T11:09:46+08:00","host":"172.18.41.187","clientip":"95.216.0.38","size":16124,"responsetime":0.813,"upstreamtime":"0.813","upstreamhost":"127.0.0.1:9000","http_host":"it.baiked.com","uri":"/index.php","query_string":"wpzmaction=add&postid=2096","request_method":"GET","xff":"-","referer":"-","agent":"Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)","status":"200"}
{"@timestamp":"2018-07-06T11:09:49+08:00","host":"172.18.41.187","clientip":"95.216.0.38","size":16133,"responsetime":0.819,"upstreamtime":"0.819","upstreamhost":"127.0.0.1:9000","http_host":"it.baiked.com","uri":"/index.php","query_string":"wpzmaction=add&postid=2111","request_method":"GET","xff":"-","referer":"-","agent":"Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)","status":"200"}
{"@timestamp":"2018-07-06T11:09:52+08:00","host":"172.18.41.187","clientip":"95.216.0.38","size":16124,"responsetime":0.796,"upstreamtime":"0.796","upstreamhost":"127.0.0.1:9000","http_host":"it.baiked.com","uri":"/index.php","query_string":"wpzmaction=add&postid=2125","request_method":"GET","xff":"-","referer":"-","agent":"Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)","stat
us":"200"}
{"@timestamp":"2018-07-06T11:09:54+08:00","host":"172.18.41.187","clientip":"95.216.0.38","size":10193,"responsetime":0.503,"upstreamtime":"0.503","upstreamhost":"127.0.0.1:9000","http_host":"it.baiked.com","uri":"/index.php","query_string":"wpzmaction=add&postid=1272","request_method":"GET","xff":"-","referer":"-","agent":"Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)","stat
us":"200"}

由于网站上线时间不长,让爬虫弄得极不稳定,这样很是影响用户体验,但是我又不想让爬虫过来的流量浪费,下面我给大家说一下我的处理方法,如下:

1:我们要找出那些是我们不允许的爬虫,当然,百度谷歌的你的放开,不然就没法推广了。

我这里nginx是这样配置的,如下:

if ($http_user_agent ~* "FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms|^$" ) {
 rewrite ^/(.*)$ http://www.baidu.com/link?url=SdN-Qrmgjx9HtLQ20-es5WiWrmvIfI-Csfrgg54sYfK;
 }

上面这段ngix代码放到server中,这样当发现agent中出现如上关键字,直接转到百度搜索的博客网站上了,有利于推广,呵呵

2:对于同一个ip的访问频率也要控制一下,不然同一个ip不停的访问也没啥意义,nginx配置如下:

http中加入如下代码:

limit_conn_log_level error;
limit_conn_status 503;
limit_conn_zone $binary_remote_addr zone=one:10m;
limit_conn_zone $server_name zone=perserver:10m;
limit_req_zone $remote_addr zone=allips:10m rate=20r/s;

server中放入如下代码:

limit_conn one 50; 
limit_conn perserver 1000; 
limit_req zone=allips burst=5 nodelay;

参数解释如下:

zone=one或allips 表示设置了名为“one”或“allips”的limit_req_zone存储区用来存储session,大小为10M rate=20r/s 的意思是以$binary_remote_addr 为key,限制平均每秒的请求为20个,即允许1秒钟不超过20个请求。1M能存储16000个状态,rate的值必须为整数。如果限制两秒钟一个请求,可以设置成30r/m。 
limit_conn one 50 限制每ip每秒不超过50个请求,漏桶数burst为5. burst=5 brust的意思就是,如果第1秒、2,3,4秒请求为19个,第5秒的请求为25个是被允许的。但是如果你第1秒就25个请求,第2秒超过20的请求返回503错误。 #nodelay,如果不设置该选项,严格使用平均速率限制请求数,第1秒25个请求时,5个请求放到第2秒执行,设置nodelay,25个请求将在第1秒执行。如果没有该字段会造成大量的tcp连接请求等待。 
limit_conn perserver 1000表示该服务提供的总连接数不得超过1000,超过请求的会被拒绝

 

然后我们再看看nginx日志:

{"@timestamp":"2018-07-06T12:11:11+08:00","host":"172.18.41.187","clientip":"151.80.39.177","size":161,"responsetime":0.000,"upstreamtime":"-","upstreamhost":"-","http_host":"it.baiked.com","uri":"/wp-content/themes/begin/inc/go.php","query_string":"url=http://%0D%20www.raducobra.com/0b1ad934/young-thug-ooou.html","request_method":"GET","xff":"-","referer":"-","agent":"Mozilla/5.0 (compatible; AhrefsBt/5.2; +http://ahrefs.com/robot/)","status":"302"}
{"@timestamp":"2018-07-06T12:11:13+08:00","host":"172.18.41.187","clientip":"54.36.149.47","size":161,"responsetime":0.000,"upstreamtime":"-","upstreamhost":"-","http_host":"it.baiked.com","uri":"/author/admin/","query_string":"wpzmaction=add&postid=1456","request_method":"GET","xff":"-","referer":"-","agent":"Mozilla/5.0 (compatible; AhrefsBot/5.2; +http://ahrefs.com/robot/)","status":"302"}
{"@timestamp":"2018-07-06T12:11:39+08:00","host":"172.18.41.187","clientip":"54.36.148.120","size":161,"responsetime":0.000,"upstreamtime":"-","upstreamhost":"-","http_host":"it.baiked.com","uri":"/jdkapi1.6/javax/imageio/event/package-tree.html","query_string":"-","request_method":"GET","xff":"-","referer":"-","agent":"Mozilla/5.0 (compatible; AhrefsBot/5.2; +http://ahrefs.com/robot/)","status":"302"}
{"@timestamp":"2018-07-06T12:15:31+08:00","host":"172.18.41.187","clientip":"54.36.149.54","size":161,"responsetime":0.000,"upstreamtime":"-","upstreamhost":"-","http_host":"it.baiked.com","uri":"/jdkapi1.6/java/io/DataOutput.html","query_string":"-","request_method":"GET","xff":"-","referer":"-","agent":"Mozilla/5.0 (compatible; AhrefsBot/5.2; +http://ahrefs.com/robot/)","status":"302"}

全转到百度搜索我网站的页面去了,让爬虫来的更多些

 

 

 

  • 安卓客户端下载
  • 微信扫一扫
  • weinxin
  • 微信公众号
  • 微信公众号扫一扫
  • weinxin
avatar