Monday, December 31, 2018

Disneyland Thesaurus 2018 Year in Review

Although I have been quiet on the blogging front, this year has been the most productive on the thesaurus itself since it came into existence on August 19, 2007. In raw numbers, in 2018 I added 32,471 terms, or 32% of the 101,868 terms of the current total. The number of relationships has more than doubled, from 216,414 to 434,653. But adding terms and relationships is only one aspect of the overall project. The following summary stretches back to the last quarter of 2017, when I resumed work on the thesaurus in earnest.

Identifying and Acquiring Sources
The thesaurus is built around information from reputable sources, so it's always been driven by a need to first build a bibliography of Disneyland sources and then acquire them (mostly as digital surrogates). In years past this has involved a lot of scanning (of Disneyland Lines in particular). This year has been much more focused on acquiring digital objects—tens of thousands of them, in fact.

I maintain two Excel tracking spreadsheets to document my work with sources, one titled Publication Tracking Spreadsheet and the other Web Material Tracking Spreadsheet. Web Material had been stagnant, with only a few of the vintage Disneyland blogs listed and only a little work done years ago to locally save the files. So, one project was to not only expand the index (by copying and pasting blog links and titles and then filling in the dates), but also to save the content for when it eventually goes offline. I am pleased to report that I saved all the Disney-related content from the "big three" vintage Disneyland blogs: Stuff from the Park, Gorillas Don't Blog, and Davelandblog. This involved saving each individual relevant post as an MHT file (which combines the HTML and graphic elements), but also each of the photos individually. I did this for a number of other blogs as well.

I have saved 10% of the Disney Parks Blog posts I've identified as relevant to the Disneyland Resort (4,914). This endeavor is a bit more complicated because of the frequent inclusion of videos, which have to be saved through a separate process. I had initially relied on the sketchy sites where you input the video URL and, after presenting you with some ads and viruses, you get a download link that often didn't work. I researched and found a program called YouTube-DLG which has made it much easier to download videos in bulk. (It is still a chore, however, to extract the video URLs and associate the downloaded files with my saved blog posts.) This blog-saving activity has yielded 54,895 files, totaling 35GB. The volume will rise rapidly as I save additional videos.

Another aspect of digital collection has been with newspaper articles. For the first time since 2010 I made an effort to augment and bring this source type up to date. In 2018 I saved an additional 1,639 newspaper articles. I now have 18,007 newspaper articles and advertisements (with the bulk being from the Los Angeles Times and Orange County Register, and still needing to save the Times from 2001-2018).

Finally, I had long neglected researching in the Anaheim Heritage Room's collection of Disneyland Lines, which they have as a Disney depository of the Walt Disney Archives. Since the Archives wasn't founded until June 1970, and the Lines I was missing were mainly in 1969 and 1970, I didn't think the collection would contain what I needed. I was pleasantly surprised to find how wrong I was. I added images from 21 Disneyland Lines from 1969-1971 and input information from them in the thesaurus.

Thesaurus Work
The bulk of the new terms has come about from two activities. The first is that I decided it would be valuable to include information on the broader Disney universe, both to contextualize existing terms and because you never know when synergy will bring what I call "general Disney" terms into direct Disneyland relevance. (Did Evinrude ever appear at Disneyland prior to Kevin and Jody integrating him into Mickey's Soundsational Parade?) I began this work with the 2016 edition of Disney A-Z and then started fresh with the encyclopedia found on (Although you would expect the online version to be the most complete and up-to-date, I have found in some cases that the 2016 print edition was more thorough. I've been keeping a running tally for a future blog post.)

As of today, the thesaurus has 9,993 General Disney terms. I've tagged them so they can be stripped from any thesaurus outputs, as needed. I still have 3,399 entries to input (which will yield many more terms than that).

The other activity—which contributed to the service anniversary milestone detailed here—has been adding in Cast Member names from lists in Disneyland Resort Lines of recent years. Well, two type of lists. One is the service anniversary list published every month. After five years, the service anniversaries for 15 years and above should be duplicative, but the 10 year names are almost always new to the thesaurus. As of today, the thesaurus has 29,249 Cast Member names. I'm going to go out on a limb to suggest that it is the largest such list outside of a Disney HR system.

I was unfortunate enough to discover that the July 17, 2005, edition of the Disneyland Resort Line contains the names of 18,700 Disneyland Resort Cast Members as of June 11, 2005. In the print edition, you could barely make out the names. But if you had access to an electronic version, you could copy the names and run some find-and-replace actions to end up with an Excel spreadsheet sorted alphabetically by last name. I tried various ways of automating the input of this information, but because I already had so many names in the thesaurus, the tedious prep work led to too many errors for my comfort. I've made use of a program to expand the Windows clipboard (Ditto) and at least make the entry as painless as possible. I've input information on 9,775 people I have linked to a "50th Anniversary Cast Members" term. Many of these people have additional citations from service anniversary lists or telephone directories, or mentions in the Disneyland Line.

I tried a new way of putting the thesaurus to use. One goal is that it could be used to index digital objects, be they photographs or documents or blog posts or what have you. I created a new field (IDX) to try applying the thesaurus in this manner. I indexed photos from 116 posts from Gorillas Don't Blog and 101 posts from Main Gate Admission. It was an interesting exercise, but I want to put more thought into it before continuing further. It did reveal the need for additional terms, such as construction (activity), photographs from Frontierland rivercraft, and photographs from Skyway.

I also took a step toward improving the possibility that I could "complete" the thesaurus. The first was in defining for myself what "complete" means, which is going through a defined set of materials to draw out terms and relationships. In rough terms, that includes going through the entire run of some internal publications (chiefly the Disneyland (Resort) Line), all newspaper articles and advertisements, a sampling of Park guidemaps through the years (preferably at least one per year), Disney A-Z, the Disney Parks Blog, and some other publications and web material. There would certainly be missing information, but I'd be comfortable that I had made a legitimate effort of comprehensively examining a substantial collection of reliable sources. (And, of course, if I reach that goal, I would be more than happy to add from other sources.)

When I first started the thesaurus, I did not source where particular information came from. I quickly realized the problem with this approach and reoriented my work to include this information—both for the term itself and the substantive information about each term. This acted sort of like a citation file a lexicographer might use, but with more encyclopedic-type information. The obvious downside is that this takes a very long time. I have come to a compromise that preserves two essential components of the thesaurus: firstly that it contain terms and relationships covering all aspects of Disneyland from the earliest days to the present, and secondly that the source for terms is documented, as a nomenclature reference.

Now as I go through a source I will add its title to the SRC field for any terms which are mentioned. This project is by-and-large about proper nouns, so I wouldn't necessarily SRC the word food every time I saw it. But I would for Ron Dominguez (or R. Dominguez, or Ronald Dominguez, or Ronald K. Dominguez, as I have also found). Where there is substantive information about a term, I will also include that information in a newly created field titled (for now) TT ("to thesaurus"). As of right now, all the sources I haven't used are essentially closed. For the 1,400 Disneyland Lines I haven't yet gone through, I wouldn't be able to tell you which might have an article on the Main Street Magic Shop or on Fantasmic! This new approach promises to make it possible for me to dramatically decrease the amount of time I spend with each source, but still make the information findable.

When we published Jason's Disneyland Almanac in 2011, we were missing Park hours from 773 dates during that time period. I am pleased to report that the number of days missing Park hours is now down to 34, between January 5, 1997, and June 30, 1999. Having access to the Los Angeles Times through allowed me to see the Calendar section from the 1980s and 1990s, which frequently contained the Park hours and is not available in the paper's text database. For more recent years I went through all the Twitter posts of @DisneylandToday. In addition to being a reliable source publishing hours day-of (always preferred to far in advance), the account also posted updates on the rare occasion of a change. Additionally, I brought the weather information in the thesaurus up to date through 2017.

Finally, I made some progress on cleaning up the hierarchy. This is a never-ending battle as new terms and (especially) concepts are introduced. I know some parts of the thesaurus desperately need work. It is very difficult to see the hierarchy from within the thesaurus construction program itself; I was lucky to have a friend create a way for me to expand and contract the hierarchy at various levels to help me as I make decisions in this vein.

What's next for 2019? Well, tying up some of the loose ends I mentioned (in particular the 50th Anniversary Cast Members and Disney A-Z) will be a big help. I intend to continue improving the hierarchy and going through more sources in the new way detailed above. I would like to finish saving the Disney Parks Blog and bring my newspaper corpus up-to-date. There's still a lot of work to do.

Peter Mark Roget (of Roget's Thesaurus) and I share a birthday, so I have kind of looked to him as a model. Roget began work on this thesaurus in his mid-20s, but didn't publish it until he was in his 70s. (Never mind that I have the benefit of a computer, of course.) That would put me on track to complete it by Disneyland's 100th birthday in 2055. Maybe a preview edition could be ready for the 75th in 2025.

Happy New Year!



That’s a lot of numbers!! Do you have a tentative completion date/goal?? So much work! Congrats!

Jason Schultz said...

Mike - See the last paragraph! Still a lot of work to do on the input side, but there's also a lot to consider for outputs (whether print or electronic). A print edition containing all the terms and relationships would easily run thousands of pages and be very costly to produce. Start saving up!

Henshaw said...

You made some good points there. I searched for the subject matter and found most individuals will go along with your blog. Congratulations! This is very recent too, and I thought of informing you guys on NECO result checker pin