August 8, 2010 by yorksranter 0 comments

scraping the barrel

I’ve finally got around to answering my own question here. The scraper is work in progress at the moment; the original pdf is rendered by pdftohtml into a tiresomely semi-structured (i.e. worse than no structure) tagpile. I was trying to tackle this through recursion, but I might either try using Python’s continue keyword or perhaps trying to pre-tokenise the document based on the number of blank lines between blocks, and then deal with the blocks.

This all depends on the thing actually having any underlying structure, of course – it may be assembled by copy-and-paste, so anything I do will blow up every month. The things I do for England…

Leave a Reply Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

« Complément d’enquête » sur Gérard Depardieu : retour sur un an d’intox et de mensonges | Mediapart
La justice a été saisie et un expert doit prochainement examiner les rushs de « Complément d’enquête ». Mediapart a pu les consulter il y a plusieurs mois. Ces images brutes et sans floutage prouvent que les arguments avancés par le clan Depardieu sont totalement infondés. Rien ne permet aujourd’hui d’affirmer que Gérard Depardieu ne […]
Petition launched to turn Brighton i360 into a Wetherspoons | The Argus
The Green-led council, spearheaded by their leader Jason Kitcat, was backed by the Conservatives to grant the loan. When the attraction opened in 2016, the business case estimations turned out to be inaccurate. Brighton i360 Ltd’s debts to Brighton and Hove City Council hit £48 million in June 2023 // Jason Kitcat! for it is […]
François Bayrou, toujours sans gouvernement et déjà affaibli
« L’impression que donne François Bayrou est celle d’avoir fantasmé son arrivée au pouvoir pendant trente ans, en espérant l’union des Français, de la gauche et de la droite, autour de lui. Mais il n’a pas de majorité, ni vraiment le soutien enthousiaste de l’Elysée, et il est faible dans l’opinion », observe Benjamin Morel, […]
Linear no-threshold model - Wikipedia
Radiation precautions have led to sunlight being listed as a carcinogen at all sun exposure rates, due to the ultraviolet component of sunlight, with no safe level of sunlight exposure being suggested, following the precautionary LNT model. According to a 2007 study submitted by the University of Ottawa to the Department of Health and Human […]
AMD arms itself to challenge Nvidia's push into AI PC market
Industry sources suggest that Nvidia plans to release its Arm-based consumer PC platform in September 2025, featuring in-house CPU and GPU designs aimed at the high-end AI PC market. A commercial platform is expected to follow in March 2026, according to reports from Patently Apple and Tom's Hardware....AMD launched its first Arm-based SoC, the Opteron […]
TSMC capacity expansion powers supply chain for 2025 boom
TSMC has begun constructing its first German fab, with equipment installation set for the third quarter of 2027. Mass production is anticipated by late 2027, gradually reaching a monthly output of 40,000 wafers by 2028. TSMC's Kumamoto Fab 1 specializes in 28nm and 22nm processes, targeting 30,000 wafers per month by early 2025 and scaling […]
Micron advances HBM4 development, sets 2026 for mass production
HBM4 utilizes Micron's 1β (5th Generation 10nm-class) DRAM technology, integrating up to 16 DRAM dies per stack, each offering 32 GB of capacity. The technology features a 2048-bit interface and 6.4 GT/s data rates, delivering a peak bandwidth of 1.64 TB/s per stack. Production timing aligns with the release of Nvidia's Vera Rubin GPUs and […]
HS2’s £100m ‘bat shield’ tunnel is not bat-proof
“The aperture size of 43mm x 23mm appears large to keep out Bechstein’s bats, which are very small in size,” said a design review commissioned by HS2. Originally the holes were to be 25mm x 25mm.
Michael Dnes on X: "So when we weren’t looking for them, we thought these bats were rare and under threat Now we’ve looked, we’ve found lots. But because that’s about data collection, it shouldn’t change their status Because nothing has changed
Cobra effect. But for bats.
‘Free and impartial’ addiction helplines paid secret commission by rehabs | Drugs | The Guardian
They were calling up our finance department and demanding early payment. They’d want their commission for the whole length of stay upfront, even though people might drop out and want refunds,” he said. “And we were getting all these emails from the brokers with names of people, saying: ‘If this person writes to you, they […]