User:OldCaliber

From Infogalactic: the planetary knowledge core
Jump to: navigation, search

Original Galaxian.

A snippet of Perl to transform a list of HTML Anchors into a Wikitext list

    while (<>) {
      my @elts = split(/<|>|\"/, $_ );
      print "*[http://some-site/$elts[2] $elts[4]\]" . "\n"; 
    }

Run it like perl the-file-with-the-above < some-list-of-HTML

You might have to get the HTML into the appropriate form first using Vim or whatever.

Web scraping with Perl

The following small bit of Perl shows how to scrape lists etc from web sites (like Taki's mag, but it needs generalizing):

#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
use URI;
use lib "lib";
use Web::Scraper;

my $i = 0;
my $uri = shift @ARGV or die "URI needed";

my $scraper = scraper {
    process "h3.title > a", 'list[]' => {link => '@href', text => 'TEXT'} ;
    process "p.byline", 'bylist[]' => {byline => 'TEXT'};
    process "p.date", 'dalist[]' => {date => 'TEXT'};
    process ".pagination > a", 'pag[]' => {link =>'@href', text => 'TEXT'};
    #result 'urls';
};

my $links;

while ($uri) {
  $links = $scraper->scrape(URI->new($uri));
  use YAML;
  warn Dump $links;
  print "List elets count: " . $#{$links->{list}} . "\n";
  print "Byline count: " . $#{$links->{bylist}} . "\n";
  print "Date count: " . $#{$links->{dalist}} . "\n";

  # Skip the first one as it is not an article
  for $i (1 .. $#{$links->{list}}) {
    if ($links->{bylist}[$i - 1]->{byline} =~ m/Steve Sailer/) {
      print "|[" . $links->{list}[$i]->{link} . " " . $links->{list}[$i]->{text} . ", " . $links->{dalist}[$i - 1]->{date} . "]\n";
    }
  }

  $uri = undef;
  for $i (0 .. $#{$links->{pag}}) {
    #print "Text for pag links: " . $links->{pag}[$i]->{text} . "\n";
    if ($links->{pag}[$i]->{text} =~ m/>/) {
      #print "Next link: " . $links->{pag}[$i]->{link} . "\n";
      $uri = $links->{pag}[$i]->{link};
      sleep 5; # Sleep a bit before the next one
    }
  }
}