Wednesday, November 09, 2005

Building an Online Library, One Volume at a Time

WSJ.com - By DAVID KESMODEL and VAUHINI VARA : "On a recent morning at a library at the University of Toronto, Liz Ridolfo plucked a century-old book from a shelf and placed it onto an oddly shaped metal contraption. The apparatus included a desk, two sheets of glass, two digital cameras, a foot pedal and a computer.

Ms. Ridolfo, 25 years old, used the custom-made machine to convert a 462-page book -- "English Literature: An Illustrated Record, from the Age of Johnson to the Age of Tennyson" -- into a digital file that will be placed on the Internet so anyone can read it and search its contents. The task took Ms. Ridolfo two hours, longer than most books do, because of a special complication: The book included several copies of handwritten letters by authors, that folded out from the pages and were difficult to photograph.

"This book almost killed me," Ms. Ridolfo said to her boss, Gabe Juszel, who was preoccupied with a stack of books and didn't reply. Then she walked outside for a cigarette break, pausing along the way to rub her neck.

Ms. Ridolfo is part of a massive undertaking to digitize the world's books. She is one of about a dozen scanners employed by the Internet Archive, a San Francisco nonprofit group that is spearheading the Open Content Alliance, a consortium of business and educational groups that includes Microsoft Corp., Yahoo Inc., Hewlett-Packard Co., Adobe Systems Inc. and several university libraries.

The group wants to build an online library of millions of old books and hopes to make a big batch accessible through Web searches as early as next year. For all its technical sophistication, the group needs the manual work of people like Ms. Ridolfo to make digitization a reality.

Google Inc.'s unrelated book-scanning efforts have come under fire because some publishers and authors say the search giant is violating copyright law. But the Open Content Alliance has sidestepped legal troubles by focusing on books published before 1923 -- and therefore out of copyright in the U.S. -- as well as some newer books publishers have allowed it to scan.

The human book scanners are in a fledgling line of work that requires meticulous attention to detail and tolerance for repetitive tasks. "I find that some people get into the Zen of it, and then some people are terrible at this type of task," said Brewster Kahle, an entrepreneur who made millions of dollars selling companies to Amazon.com Inc. and Time Warner Inc.'s America Online, and in 1996 founded the Internet Archive.

A Five-Hour Shift

Ms. Ridolfo, a petite, soft-spoken woman with glasses, scans books four times a week in five-hour shifts, and is paid 12 Canadian dollars (US $10.15) an hour. She's one of four scanners who work in a windowless room on the second floor of the four-story John M. Kelly Library, part of the University of St. Michael's College, a division of the University of Toronto offering courses in religion, book studies and philosophy. Only two scanners can work at any one time because the office has just two scanning machines. The library loaned the room to the Internet Archive and is having many of its own books digitized there.

During each shift, Ms. Ridolfo sits in a soft office chair in front of an apparatus designed by engineers from the Internet Archive. The machine is about six feet tall and five feet wide, and is largely covered with a black tarp to keep out light. Ms. Ridolfo places each book on a V-shaped tray beneath two sheets of glass, also in a V-shape. Two digital cameras hang above her, mounted on brackets linked to the rest of the machine. The camera over her right shoulder is angled to snap photos of the left page; the camera at her left shoots the right page.

Ms. Ridolfo uses a foot pedal to raise the glass up and down so she can quickly turn the pages for each new photo. The pages show up on a computer screen in front of her, and she uses a mouse to crop pages and make other adjustments when necessary. She can click the mouse to snap a photo, or she can set the computer to have photos automatically taken every 10 seconds or so.

Many of the books Ms. Ridolfo scans are rare texts that are at least 100 years old, so she must handle them delicately. In some cases, the binding has fallen apart and the books are tied together with a ribbon.

Ms. Ridolfo began scanning books for the Internet Archive in September after responding to an ad on Craigslist, a popular Web portal. She has a bachelor's degree in English from York University in Toronto and tutors Korean high school students in English as a part-time job. She's long been fascinated by books -- and not just for reading. "I like to go into used bookstores and smell the paper and look at the colors," she said. Her colleagues sometimes chuckle when they catch her sniffing an old text.

Each Toronto scanner works part-time. Mr. Juszel, head of the book-scanning operation for the Internet Archive's Toronto office, assigns books for the scanners before their shifts begin. After a book is scanned, Mr. Juszel transfers all the relevant computer files to the Internet Archive so they can be uploaded to its Web site, www.archive.org, and shared with its partners. He also makes sure the books are returned to the entity that donated the book for scanning -- typically a university.

Early Stages

The Internet Archive's effort to get books online is still in its early stages. In the little more than a year since the group started scanning books, it has digitized just 2,800 books, at a cost of about $108,250. Funding has come largely from libraries that have paid to have their texts digitized. Work will likely speed up now that Microsoft and Yahoo are on board; both companies joined the effort in October. Microsoft has pledged to pay for the scanning of about 150,000 books from collections at the U.K.'s British Library and elsewhere, and Yahoo will fund the scanning of 18,000 American classics at the University of California.

Mr. Kahle estimates it costs about 10 cents a page to get a book online, taking into account equipment, labor and the cost of hosting the pages on the Internet Archive's Web servers.

The funding from Microsoft and Yahoo will be used to expand the scanning operation. So far, books are being scanned at just two locations, in San Francisco and Toronto. Participating libraries ship their books to those scanning centers, where a total of eight scanning machines are in use. The group hopes to use new funding to buy more machines, which cost $20,000 to $40,000 each (the more expensive machines can work faster, and can accommodate larger books).

An Unexpected Challenge

On a recent morning, Ms. Ridolfo and a fellow scanner, LaJolla Young worked quietly. Both used headphones to help pass the time – a Canadian news broadcast for Ms. Ridolfo, 80s music for Mr. Young. The only noise in the room was a persistent hum from two big computer hard drives under the desk of Mr. Juszel, the supervisor.

Ms. Ridolfo scanned her first book -- an early 20th century copy of works by William Shakespeare -- in about 40 minutes. Then she encountered her toughest assignment of the shift: the book about English authors, which weighed 10 pounds. The most vexing part came between pages 364 and 365, where Ms. Ridolfo found a copy of a lengthy, two-sided letter by Robert Louis Stevenson, written in cursive. Mr. Young, a more experienced scanner, helped Ms. Ridolfo figure out how to position the book on her scanning machine to capture a clear image of the entire letter.

Her neck became sore during the endeavor. "It only really happens with the books that are really challenging," she said. During her cigarette break afterwards, she added, "That book was cool, though. It was a lot more visually interesting than a lot of books."

Many of the books Ms. Ridolfo scans are part of specific collections that the Internet Archive is digitizing on behalf of its partners. She has recently scanned parts of the university's rare collection of works by or about John Henry Cardinal Newman, a 19th century English theologian, and its collection of books by English author G.K. Chesterton.

But her first two books on this day were randomly selected by Mr. Juszel from the stacks at the library while he waited for library workers to bring books from the Chesterton collection to the scanning room. "If we're ever in a lull, I just have to walk out into the stacks and say, 'OK, what's out of copyright?'" said Mr. Juszel. "We're in a library. There's no lack of material."

The Internet Archive closely tracks each book that has been scanned, and a computer alerts employees if they try to scan a book that has already been digitized.

It takes Ms. Ridolfo and the other scanners about one hour to scan a 500-page book. If a book is in poor condition -- if pages are hanging by a thread, for instance -- it can take several hours. The Toronto group, which is responsible for the bulk of the books scanned so far by the project, has digitized books from the Library and Archives of Canada, the University of Ottawa and St. Mary's College in California. The oldest book scanned was a 1475 title, "The City of God," by St. Augustine. The books can be searched on a page for books from Canadian libraries.

Ms. Ridolfo finished her shift by scanning two books by Mr. Chesterton and most of a third -- bringing her total to almost five books in five hours. She has scanned about 125 books for the project.

She said the job is one of the best she has ever had. She has worked other repetitive jobs, including stocking shelves at a grocery store and working in a beer-nut factory. Before she started book scanning, she took a temp job stapling sample sticks of chewing gum to fliers handed out at Toronto nightclubs. "It was akin to hell," she said.

As a book scanner, she gets to peruse old writings and illustrations and find gems between the pages -- such as a 1915 postcard from a son to his father that she recently discovered. "You get into a rhythm," she said. "If you have a really good book, you look up and a half an hour is passed. It's kind of like meditation."

No comments: