While testing my scripts if they run under Vista, i found a little script, an indexer for webpages. After some years, i tried to make it ready for unicode.
After trying some Modules I did not found a good solution. The Problem ist: To change a website to UTF8, Perl has to know the encoding of the original Html-Code.
After hours of trying, I found this way to change any website (?) to utf8.
Here is the script:
#!/usr/bin/perl print <<EOF;
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
#count words longer than 1 character grep {$woerter{$_}++ if length($_) >1} split(/\n/,$string);
# and show foreach (sort(keys %woerter)){print "$_: $woerter{$_} mal<br>\n";}
open (out,">test.txt") || die ("fehler");
#binmode(out, ":utf8"); foreach (sort(keys %woerter)){print out "$_: $woerter{$_} mal<br>\n";} close out;
#######################
#lädt eine Seite aus dem Internet
####################################
sub lade_seite{
use LWP::UserAgent;
use HTTP::Request;
use Encode; my ($url) = @_; print $url; my $content; my $encoding; my $request = HTTP::Request->new(GET => $url); my $ua = LWP::UserAgent->new;
$ua->agent('User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.4) Gecko/20070515 Firefox/2.0.0.4');
################################################################
#entfernt aus einer Liste doppelte einträge
################################################################
sub del_double{ my %all=();
@all{@_}=1;
return (keys %all);
}
To test the script, click here. The script loads the startpage of this blog and shows it's results.
I tested this script (of course) with german, greek, turkish, hebrew, icelandic chars and japanic... and it seems, it works (beside: I can NOT read hebrew, iceland and japanic letters... but I thing they are right).
So, what does this script and how it works
The script reads a website via www and changes it to utf8. After that, it changes all the words (without HTML) to lowercase and shows, how many times every word appears. So far very easy... without utf8.
In utf8 everything is different. Using lc(word) can destroy letters, if they are not correctly changed.
The solution is the Encode->decode($charset,$content), where $charset is the given charset in the webpage. If no charset is given or the charset is UTF-8, the webpage is already utf8. If not, decode transfers the content to utf8.