In planning a taxonomy, I have often said that it is important to define the taxonomy’s scope, specifically the subject area scope of the taxonomy’s terms, but without going into more detail. Recently I was asked by a client how to define a taxonomy’s scope. This is a good question. The taxonomy should be suited to the subject area scope of the content that will be tagged with the taxonomy and to the scope of the user’s expectations. Terms or topics only marginal to the subject scope, however, could occur in the content, and whether they should also be included in the taxonomy is a question. Ultimately, that should depend on whether user expectations justify it, as the needs of users should also be a factor in creating a taxonomy. A taxonomy should suit both its content and its users.
Sources for Taxonomy Terms
For
content as a source of taxonomy terms, a combination of manual and
automated approaches is recommended. By manually reviewing sample
individual documents or content items, you can discern the main ideas
and main topics, which should form the start and basic structure of the
taxonomy and also help define its scope. Automated methods of extracting
terms, through text analytics technologies, can bring in many
additional terms from a much larger corpus of documents more quickly,
picking up terms that a limited manual review would miss. Even though
automated text analytics extracts terms based on relevancy and frequency
of occurrence, such terms could be out of scope of the subject domain.
That’s why it’s important to start first with a manual review of content
to define the subject scope. Then, when you enrich the taxonomy with
automated extraction, you can approve terms that appear to be in scope
or at least closely relevant and reject others. But should you reject
all that are out of scope, even if they appear with sufficient frequency
and relevancy? My advice is to try to assume the role of the user. Ask
yourself: Might a user want to search for content on this term in this
content collection?
For user needs and expectations as a
contributing source of taxonomy terms, obtaining this information can be
very direct, such as by creating a user questionnaire (at least for
your internal users) that asks what the topics of importance are, how
they would define the scope, and what “marginal” topics would be
acceptable to include. You could also request sample challenging (not
expected, basic, typical) queries that the users would make. Another
good way to obtain input from the user side is to look at search query
logs that list search strings that users have entered over a period of
time, ranked by frequency. If a search phrase that is slightly out of
scope of the subject occurs frequently, then the term should still be
considered for inclusion in the taxonomy.
In either case, the
scope of the subject gets better defined as the taxonomy is created. For
example, a taxonomy for recipes may initially be scoped to comprise
terms for the names of dishes, ingredients, and cooking method. But then
a different term shows up significant frequency, “Nutrition Facts.” If
it occurs in both the content and the user research, then it likely
should be included. If it shows up in the content only, but is not
validated in user research, then it is more questionable.
Taxonomy Structure
The
initial taxonomy structure itself tends to impose limits on scope.
Taxonomies tend to be hierarchical with a limited number of top terms.
If a candidate term appears in the content that does not seem to belong
anywhere in the current taxonomic hierarchy, you might be inclined to
exclude it. Factors of user needs (they might want to look up this term
in this content), however, should take precedence. For example, the term
“COVID-19” might be marginal but still of interest to be included many
taxonomies on varied subjects, but there would exist no broader term for
diseases in those taxonomies. Then adjustments need to be made, such as
renaming or adding broader terms, or perhaps, more likely, the proposed
term should be modified to fit the context of the taxonomy, such as
becoming q“COVID-19 impacts.”
Another thing to consider is
adopting more a thesaurus structure than a taxonomy structure, at least
for the facet or concept scheme of the taxonomy that is for
miscellaneous “topics.” One characteristic of thesauri is to not rely so
heavily on extensive hierarchical trees. What this means is that you
could decide that it is acceptable that not all terms have broader terms
and thus it’s OK to have a very large number of top terms, with the
more specific terms linked to other terms only by related-term
relationships, another feature of thesauri, if not by
broader/narrower-term relationships. Abandoning the full hierarchical
tree structure should only be considered if this hierarchy is not
displayed as a navigation to the end users.
Documenting Policy
In
any case you need to define policies regarding what kinds of terms can
be added and what kinds should not. This will evolve out of the activity
of building the taxonomy, especially from evaluating what extracted
terms to approve and what search log terms to approve. Whoever is doing
this task (hopefully more than one person), should document each
instance of uncertainty. While many term approvals and rejections will
be obvious, there will be a gray area. This should be collected and
discussed together, and then a policy can emerge.