Как извлечь ссылки со страницы?

roddik · 17 Дек 2007

вот это красиво, преги - сакс!

Today must be guest post day here at Blue Hat. This one is written by Justin over at Для просмотра ссылки Войди или Зарегистрируйся. I figured quite a few of you reading up on the SEO Empire might be able to use the information. Sorry it took me so long to get it up buddy.

——————————————-
In this tutorial you will learn how to build a PHP script that Для просмотра ссылки Войди или Зарегистрируйся links from any web page.
What You’ll Learn

Для просмотра ссылки Войди или Зарегистрируйся to get the content from a website (URL).
Для просмотра ссылки Войди или Зарегистрируйся to parse the HTML so you can extract links.
Для просмотра ссылки Войди или Зарегистрируйся from specific parts of a page.
Для просмотра ссылки Войди или Зарегистрируйся in a MySQL database.
Для просмотра ссылки Войди или Зарегистрируйся
Для просмотра ссылки Войди или Зарегистрируйся
Для просмотра ссылки Войди или Зарегистрируйся associated with scraping content.

What You Will Need

Basic knowledge of Для просмотра ссылки Войди или Зарегистрируйся and Для просмотра ссылки Войди или Зарегистрируйся.
A web server running PHP 5.
The Для просмотра ссылки Войди или Зарегистрируйся extension for PHP.
MySQL - if you want to store the links.

Get The Page Content

Для просмотра ссылки Войди или Зарегистрируйся is a great tool for making requests to remote servers in PHP. It can imitate a browser in pretty much every way. Here’s the code to grab our target site content:
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);
if (!$html) {
echo "<br />cURL error number:" .curl_errno($ch);
echo "<br />cURL error:" . curl_error($ch);
exit;
} If the request is successful $html will be filled with the content of $target_url. If the call fails then we’ll see an error message about the failure.
curl_setopt($ch, CURLOPT_URL,$target_url); This line determines what URL will be requested. For example if you wanted to scrape this site you’d have $target_url = “Для просмотра ссылки Войди или Зарегистрируйся. I won’t go into the rest of the options that are set (except for CURLOPT_USERAGENT - see below). You can read an in depth Для просмотра ссылки Войди или Зарегистрируйся.
Tip: Fake Your User Agent

Many websites won’t play nice with you if you come knocking with the wrong Для просмотра ссылки Войди или Зарегистрируйся string. What’s a User Agent string? It’s part of every request to a web server that tells it what type of agent (browser, spider, etc) is requesting the content. Some websites will give you different content depending on the user agent, so you might want to experiment. You do this in cURL with a call to curl_setopt() with CURLOPT_USERAGENT as the option:
$userAgent = 'Googlebot/2.1 (Для просмотра ссылки Войди или Зарегистрируйся
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent); This would set cURL’s user agent to mimic Google’s. You can find a comprehensive list of user agents here: Для просмотра ссылки Войди или Зарегистрируйся.
Common User Agents

I’ve done a bit of the leg work for you and gathered the DOM Functions. The DOM Functions allow you to parse HTML (or XML) into an object structure (or DOM - Document Object Model). Let’s see how we do it:
$dom = new DOMDocument();
@$dom->loadHTML($html); Wow is it really that easy? Yes! Now we have a nice Для просмотра ссылки Войди или Зарегистрируйся object that we can use to access everything within the HTML in a nice clean way. I discovered this over at Russll Beattie’s post on: Для просмотра ссылки Войди или Зарегистрируйся, thanks Russell!
Tip: You may have noticed I put @ in front of loadHTML(), this suppresses some annoying warnings that the HTML parser throws on many pages that have non-standard compliant code.
XPath Makes Getting The Links You Want Easy

Now for the real magic of the DOM: Для просмотра ссылки Войди или Зарегистрируйся! XPath allows you to gather collections of DOM nodes (otherwise known as tags in HTML). Say you want to only get links that are within unordered lists. All you have to do is write a query like “/html/body//ul//li//a” and pass it to Для просмотра ссылки Войди или Зарегистрируйся. I’m not going to go into all the ways you can use XPath because I’m just learning myself and someone else has already made a great list of examples: Для просмотра ссылки Войди или Зарегистрируйся. Here’s a code snippet that will just get every link on the page using XPath:
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a"); Iterate And Store Your Links

Next we’ll iterate through all the links we’ve gathered using XPath and store them in a database. First the code to iterate through the links:
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
storeLink($url,$target_url);
} $hrefs is an object of type Для просмотра ссылки Войди или Зарегистрируйся and Для просмотра ссылки Войди или Зарегистрируйся is a function that returns a DOMNode object for the specified index. The index can be between 0 and $hrefs->length. So we’ve got a loop that retrieves each link as a DOMNode object.
$url = $href->getAttribute('href'); DOMNodes inherit the Для просмотра ссылки Войди или Зарегистрируйся function from the Для просмотра ссылки Войди или Зарегистрируйся class. getAttribute() returns any attribute of the node (in this case an <a> tag with the href attribute). Now we’ve got our URL and we can store it in the database.
We’ll want a database table that looks something like this:
CREATE TABLE `links` (
`url` TEXT NOT NULL ,
`gathered_from` TEXT NOT NULL ,
`time_stamp` TIMESTAMP NOT NULL
); We’ll a storeLink() function to put the links in the database. I’ll assume you know the basics of how to connect to a database (If not grab a Для просмотра ссылки Войди или Зарегистрируйся).
function storeLink($url,$gathered_from) {
$query = "INSERT INTO links (url, gathered_from) VALUES ('$url', '$gathered_from')";
mysql_query($query) or die('Error, insert query failed');
} Your Completed Link Scraper

function storeLink($url,$gathered_from) {
$query = "INSERT INTO links (url, gathered_from) VALUES ('$url', '$gathered_from')";
mysql_query($query) or die('Error, insert query failed');
}

$target_url = "http://www.merchantos.com/";
$userAgent = 'Googlebot/2.1 (Для просмотра ссылки Войди или Зарегистрируйся

// make the cURL request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
if (!$html) {
echo "cURL error number:” .curl_errno($ch);
echo “cURL error:” . curl_error($ch);
exit;
}

// parse the html into a DOMDocument
$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate(”/html/body//a”);//или так $hrefs = $xpath->evaluate(”//a”);

for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute(’href’);
storeLink($url,$target_url);
echo “Link stored: $url”;
}

What Else Could I Do With This Thing?

The possibilities are limitless. For starters you might want to store a list of sites that you want scraped in a database and then set up the script so it runs on a regular basis to scrap those sites. You could then compare the link structure over time or maybe republish the links in some sort of directory. Leave a comment below and say what you’re using this script for. Here are a few other things people have done with scrapers in the past:

Build a search engine from the content you gather. Для просмотра ссылки Войди или Зарегистрируйся
Analyze a site to determine how well it is SEO optomized for keywords. Для просмотра ссылки Войди или Зарегистрируйся.
Republish free content dynamically on your website.
Create an RSS feed from a website. Для просмотра ссылки Войди или Зарегистрируйся

Is Scraping Content Legal?

There is no easy answer to this question. Many organizations scrap content from all over the web - Google, Yahoo, Microsoft, and many others. These companies get away with it under Для просмотра ссылки Войди или Зарегистрируйся and because site owners want to be included in the search results. However, there have been Для просмотра ссылки Войди или Зарегистрируйся against these companies.
The real answer is that it depends who you scrape and what you do with the content. Basic copyright law gives authors an automatic Для просмотра ссылки Войди или Зарегистрируйся on everything they create. But the same law permits Для просмотра ссылки Войди или Зарегистрируйся of copyrighted material. Fair use includes: criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship, or research. But even these uses could be considered copyright infringement in some circumstances. So be careful before you claim “fair use” as your defense!
Here’s a couple sites that have granted you the right to use their content. They do require you to attribute the content to the author or the URL you scraped it from:

Karlasan · 18 Дек 2007

Курл есть не на всех серверах. В частности, phpinfo() на местерхосте показало что там его как раз нет (к моему удивлению).
далее: ссылки на XPath в сообщении побитые - все ведут на php.net. Хотелось бы посмотреть что это за зверь такой.

roddik · 18 Дек 2007

Karlasan написал(а):
Курл есть не на всех серверах. В частности, phpinfo() на местерхосте показало что там его как раз нет (к моему удивлению).
далее: ссылки на XPath в сообщении побитые - все ведут на php.net. Хотелось бы посмотреть что это за зверь такой.

ну курл или не курл, тема в принципе не об этом
сообщение не я-то писал, ну для ознакомления секретный сайт:

Для просмотра скрытого содержимого вы должны войти или зарегистрироваться.

General Fizz · 2 Фев 2008

Регулярка для извлечения урлов:

Код:

(ftp|http|https|gopher|telnet|nntp)://([_a-z\d\-]+(\.[_a-z\d\-]+)+)(([_a-z\d\-\\\./]+[_a-z\d\-\\/])+)*

До кучи еще для мыл:

Код:

\b([_a-z0-9-]+(\.[_a-z0-9-]+)*)@([_a-z0-9-]+(\.[_a-z0-9-]+)*)\.([a-z]{2,3})\b

Утянуто из Text Pipe Pro

nami144 · 6 Фев 2008

$dir=scandir('source_files');
foreach($dir as $i)
{
if($i{0}=='.')
continue;
$handle=fopen("source_files/$i","r");
flock($handle,LOCK_EX);
preg_match_all("/(http:\/\/[^\s><#\'\"]+)/i", fread($handle,filesize("source_files/$i")),$matches);
fclose($handle);
$handle=fopen("source_files_done/$i","w");
flock($handle,LOCK_EX);
foreach($matches[1] as $j)
fwrite($handle,$j."\n");
fclose($handle);
}

Я использую такой скрипт, но он почему-то не на все ссылки действует

Алгоритм таков:
1. сохраняем страницы в папку source_files
2. запускаем скрипт
3. Получаем в папке source_files_done те же файлы, но там осталсь одни лишь ссылки

Yola · 21 Мар 2008

такую регулярку использую preg_match_all("/<\s*a\s.*href\s*=[\"\']([^>]+)[\"\'][^>]*>(.+)<\/a>/Uis",$text,$links);

ortega3000 · 25 Мар 2008

Пример рабочего кода

Вот такой код работает во всех случаях.

PHP:

<?php
$a = '<a href="test1.ru">text1</a>
text0
<a target="_blank" href="test2.ru">text2</a>
text0-1
<a target="_blank" href="test4.ru" class="class1">text3</a>';

$re = '/<a.*href\s*=\s*["\']([^"\']+)["\'][^>]*>/U';

if (preg_match_all($re, $a, $m)){
	print_r($m);
}

?>

Как видите, я привел три варианта написания тега <a>. И во всех трех случаях этот тег был распознан, а нужная инфа (URL) была выдрана из него. Результат работы этого скрипта вот такой:

Код:

Array
(
    [0] => Array
        (
            [0] => <a href="test1.ru">
            [1] => <a target="_blank" href="test2.ru">
            [2] => <a target="_blank" href="test4.ru" class="class1">
        )

    [1] => Array
        (
            [0] => test1.ru
            [1] => test2.ru
            [2] => test4.ru
        )

)

В заключение хочу добавить, что надо учитывать все варианты написания тегов, чтобы избежать возможных недоумений в применении выражений.

celerons · 26 Мар 2008

Как нащёт обращаться к данной задаче через яву?

обращаться к ссылкам на страниц, тоесть уже СФОРМИРОВАННЫХ, можно так :

Для просмотра скрытого содержимого вы должны войти или зарегистрироваться.

dumber · 29 Мар 2008

kaspruk написал(а):
1 Извлекаю ссылки таким кодом: но он склеивает несколько ссылок подряд!
2 Хочу прописать в выражении, что бы нашло все, кроме определенной фразы, но не знаю как. Нашел лишь для символа [^h]* - найдет фразу, где нет букви h!
3 И вообще где можно найти хорошую книгу по регулярным выражениям, желательно с примерами!

книг нет мне помогло - __http://www.phpfaq.ru/regexp

General Fizz · 29 Мар 2008

dumber написал(а):
книг нет...

Есть книги, одна из лучших Для просмотра ссылки Войди или Зарегистрируйся, лежит на форуме.

Как извлечь ссылки со страницы?

roddik

Колбаска

Karlasan

Мой дом здесь!

roddik

Колбаска

General Fizz

Боевой Генерал :)

nami144

Постоялец

Yola

Создатель

ortega3000

Создатель

celerons

Создатель

dumber

Постоялец

General Fizz

Боевой Генерал :)