Peter Jacso Takes on Google Scholar Finding Ghost Authors, Lost Authors, and Other Problems
Access the Full Text of the Entire Article
With all of the talk about Google Book Search lately, little has been written about Google Scholar. Now, in a lengthy and well-documented analysis (numerous screenshots) published in Library Journal, Dr. Peter Jacso from the University of Hawaii at Manoa, a monthly columnist for Gale/Cengage and a friend of ResourceShelf, documents some of the problems (two of them named in the title of the article) that he has found while using Google Scholar [GS] during the past several months. Actually, some of the problems go back years.
Here are just a few passages from Dr. Jacso’s article that we found to be of greatest interest:
They [the Google Scholar developers] decided—very unwisely—not to use the good metadata generously offered to them by scholarly publishers and indexing/abstracting services, but instead chose to try and figure them out through ostensibly smart crawler and parser programs.
Millions of records have erroneous metadata, as well as inflated publication and citation counts
A free tool, Google Scholar has become the most convenient resource to find a few good scholarly papers—often in free full-text format—on even the most esoteric topics. [Our emphasis] For topical keyword searches, GS is most valuable. But it cannot be used to analyze the publishing performance and impact of researchers.
Very often, the real authors are relegated to ghost authors deprived of their authorship along with publication and citation counts. [Our emphasis] In the scholarly world, this is critical, as the mantra “publish or perish” is changing to “publish, get cited or perish.”
[Our emphasis] While GS developers have fixed some of the most egregious problems that I reported in several reviews, columns and conference/workshop presentations since 2004—such as the 910,000 papers attributed to an author named “Password”—other large-scale nonsense remains and new absurdities are produced every day.
The numbers in GS are inflated for two main reasons. First, GS lumps together the number of master records (created from actual publications), and the number of citation records (distinguished by the prefix: [citation]) when reporting the total hits for author name search.
…fee-based Web of Science and Scopus have lower article and citation counts and scientometric indicators, as they have a far more selectively defined source base with fewer journals from which to gather publication and citations data. In addition, they count only the master records for the authors’ publication count (as they should), and keep the stray and orphan citations in a separate file.
Unfortunately, the bad metadata has a long reach. These numbers are taken at face value by the free utilities such as the Google Scholar Citation Count gadget by Jan Feyereisl and the sophisticated and pretty Publish or Perish (PoP) software (produced by Tarma Software).
As about 10.2 million records from GBS [Google Book Search] are incorporated now in GS, the metadata disaster likely will continue unabated. It is bad enough to have so many records with erroneous publication years, titles, authors, and journal names.
In its stupor, the parser fancies as author names (parts of) section titles, article titles, journal names, company names, and addresses, such as Methods (42,700 records), Evaluation (43,900), Population (23,300), Contents (25,200), Technique(s) (30,000), Results (17,900), Background (10,500), or—in a whopping number of records— Limited (234,000) and Ltd (452,000). The numbers kept growing by several hundred thousands hits for the cumulative total of the above ”authors” during the few days this paper was being written. More screenshots are available here.
Lost Authors
These errors could be considered relatively harmless if they did not affect the contributions of genuine, real scholars. But the biggest problem is when the mess replaces real scholars with ghost authors, leaving the former as lost authors.
[Our emphasis] Certainly the entire database isn’t rotten, just a few million records. That may be a relatively small percentage—Google won’t reveal the total number of records, and these are just my few forensic search test queries—but there’s ample cause for worry.
In case of GBS [Google Book Search], Google relied on its collective Pavlovian reflex to blame the publishers and libraries (meaning the librarians, catalogers, indexers) for the wrong metadata.
In the case of Google Scholar, these same Googlish arguments will not fly, because practically all the scholarly publishers gave Google—hats in hand—their digital archive with metadata. The idea was to have Google index it and drive traffic to the publishers’ sites.
Yes, GS has fixed fairly quickly some of the major errors that I earlier used to demonstrate its illiteracy and innumeracy, but have so far left millions of others untouched.
GS designers have sent very under-trained, ignorant crawlers/parsers to recognize and fetch the metadata elements on their own. Not all of the indexing/abstracting services are perfect and consistent, but their errors are dwarfed by the types and volume of those in GS. This is the perfect example of the lethal mix of ignorance and arrogance GS developers applied to metadata and relevance ranking issues.
The parsers have not improved much in the past five years despite much criticism. GS developers corrected some errors that got negative publicity, but these were Band-Aids, where brain surgery and extensive parser training is required. Without these, GS will keep producing similar errors on a mega-scale.
ที่มา http://www.resourceshelf.com/2009/09/24/google-scholar%E2%80%99s-ghost-authors-lost-authors-and-other-problems/
ไม่มีความคิดเห็น:
แสดงความคิดเห็น