News
Hybrid Language Models

"Better Together": a New Approach to Document Search

By
No items found.
October 3, 2023
Share this post
"Better Together": a New Approach to Document Search

With 200 million academic works available, a number which doubles every nine years, finding papers that correspond or supplement research can be challenging. Along with that staggering number, the indexing of these real-world resources has proven challenging, and ongoing research has developed progressively better methods of doing so.

Enter “Better Together: Text + Context,” a research project led by Professor of the Practice and Senior Principal Research Scientist Kenneth Church, which demonstrates the value of combining existing search methods, by retrieving the best possible outcome and materials. Presented at the Center for Language and Speech Processing (CLSP)’s annual Jelinek Summer Workshop on Speech and Language Technology (JSALT), Church’s team from the Institute for Experiential AI at Northeastern University (EAI) includes members from nine institutions and four countries, building off of work by Semantic Scholar, with a focus on his hypothesis that “being able to compare and contrast different perspectives creates a deeper understanding, although the modern trend is looking for one optimal solution.”

Ricardo Baeza-Yates, director of research at EAI and professor of the practice, explains that, “Current language models only focus in the amount of training data used, but to really achieve a real understanding of the semantics of a text, we need hybrid language models that also consider the context, a knowledge base and reasoning capabilities. Better Together is a first step in this direction, adding context (citations) to the content.”

He compares the technological advancement of Better Together with a parrot learning “good morning.” At first, the parrot becomes able to create the correct sounds, which provides content as the foundation, but it does not understand the meaning of the sounds. As the parrot learns more, it begins to understand when to use the words - for example, it would not make sense to say it at nighttime, much like how existing technology for document clustering “understands” citation similarities.

Usama Fayyad, executive director at the Institute for Experiential AI, describes “Better Together” as a critical first step in breaking out of using only “deep network parameters,” instead utilizing knowledge graphs and networks in general to produce a reliable, effective model that leverage Large Language Models. Until now, LLM work has focused primarily on deep networks and trained with large documents, an approach which relies on discovering and rediscovering patterns, converting them into a more reliable form. However, “Better Together,” in his words, is “a refreshing and major change into a system where prior knowledge and known patterns are leveraged by a LLM framework to achieve much better results.”

Background

For the last 30 years, the CLSP has hosted international teams on speech and language engineering each summer through JSALT. This prestigious workshop is organized by Johns Hopkins Whiting School of Engineering, and its 2023 iteration was held in Le Mans, France by Le Mans Université.

JSALT focuses on broadening its reach and application to adjacent research communities. The “Better Together” project does this by combining two alternative ways of embedding documents into vector spaces: Church posits that SPECTER is better for capturing text (author's perspectives), and ProNE is designed to capture context (estimates of audience appreciation based on citations).

SPECTER utilizes abstracts (text) to find similar works via deep nets, while ProNE utilizes linear algebra to find spectral clusters via citations (context). However, “dirty data is a reality.”

During his lecture, Church demonstrated a ProNE (context) search with similar and highly cited papers in the results, but the same using SPECTER (text)’s best result had zero citations although it was a perfect match. How? It’s an identical paper! “The same paper can appear in the collection several times, often with nearly identical abstracts and authors, but usually, the citation graph is very different for the two papers. Therefore, multiple representations are helpful when there are missing values and/or dirty data.” As Church summarizes, simply finding the overlapping words does not help - “The abstracts match wonderfully, but it’s a terrible example.” The reverse applies with far better results when the corresponding papers have similar citations, demonstrating the improvement gained by using both systems together.

In his words, “there are lots of opportunities for improving academic search such as missing values and duplicates.  The Venn diagram below shows that we have abstracts for some papers and citations for some.  Many methods focus on the intersection, but we would like to address the union, which is much larger than the intersection.”

At its essence, the team was striving to build a graphical interface to make navigating the 200 million available papers easier. While the two currently used methods are fundamentally different, searches based on abstracts alone or citations alone are not enough. Popular benchmarks utilize the intersecting information – where text and context are both relevant – whereas Church emphasizes the value of the union’s opportunity. After all, with imperfect data, some papers have links but not abstracts or vice versa. An overview of results based on text and context finds that 31% fall out of the scope entirely. With the remaining 69%, Church reminds researchers to “use what you’ve got, and when you have neither, find something else.”

Try It!

The results of the project can be demonstrated via searches by Paper or Author here.

The search results display text information that can be sorted by three methodologies (including ProNE and SPECTER), but the most important aspect is the utilization of dotplots (a topic Church wrote about three decades ago), which use a color system to compare the similarities between search results in a way that is visually appealing and easy to comprehend. Diagonal “blocks” correspond between the search results – the brighter the color, the greater the similarity. The first “block” at the upper-left is always a bright yellow, as the paper corresponds entirely with itself, and this trend continues one block over and down throughout the charts. As you proceed down the chart, the color darkens to black as the comparison becomes more and more distant.

In instances where different authors share the same name, the subject matter of their papers can be used to determine whether or not they are indeed the same person. A review of three potentially different authors named “David Madigan” with three papers each shows correlation through the “off-diagonals” of the dotplot.

In this case, it is clear that the first and second Madigan are different, but the first and third are the same based on the subject matter - in fact, this David Madigan is the Provost and Senior Vice President for Academic Affairs at Northeastern University. This “reasoning” improves upon previous clustering work, as it provides a clear and intelligent way to distinguish which David Madigan is which. Within these results, Church emphasized the importance of credible, well-cited works using “h-index,” a figure that indicates that an author has h papers with at least h citations each.

Forecasting

For academic purposes, Church’s team split the 200 million works into 100 “bins” of two million papers each, sorted by the age of the paper. Indexes were built for each bin, starting with the first, then adding bin 1 + bin 2 all the way to bin 1 + … bin 100. In doing so, the citational graph can be reviewed through points in the past and researchers can see the similarities historically.

Although “this was a lot of work, it was well worth it” as the team could experiment with predictions – with knowledge from a decade ago, how well could they predict what would happen today? And could they predict the outcome in another ten years? As Yogi Berra said, “It's tough to make predictions, especially about the future.” Additionally, this forecasting allows estimation of vectors for papers with missing data.

Crossover Point

While “Better Together” emphasizes utilizing the full data available, he was also able to establish the “crossover point” – when text or context became more useful than the other.

Metcalfe's Law states that costs scale with the number of nodes in a graph (N), and benefits scale with the number of edges (N2).  A typical example is a telephone network where it costs ~N dollars to buy N phones, but the benefits scale with ~N2 because each of those phones can call any other phone. More recently, this law has been applied to web search and social media - Church and his team posit that it also applies to Academic Search.

The academic rule of thumb is that the total literature will double every nine years – “publish or perish.” Church wistfully noted that a graduate student today will enter the workforce with as many published papers as a tenured professor would have in the past.

Applying Metcalfe’s Law and his research findings, larger graphs have better benefits – so on a smaller scale, abstract comparisons are more valuable, and he believes the “crossover point” came about 18 years ago, when around 50 million papers were available, and the context became more useful. As a real-world application, web searches are easier to review than those of a university or enterprise, as the world’s “graph” is much larger.

He notes that “the representation of a document should combine time-invariant contributions from the authors with constantly evolving responses from the audience, like social media.” Their forecasting is a key evolution from the traditional standpoint of reviewing a single point in time, seeing which is “better,” and generalizing the results based on that stance.

Moving Forward

As with the parrot being able to create the sounds for “good morning” and knowing that this statement is only relevant at the beginning of the day, the next step is for the parrot to understand the concept of “good morning.” Eventually, it gains the knowledge to not use the phrase in an empty room. This reasoning adds a level of difficulty, and understanding would be the ultimate level of intelligence.

In the case of Better Together, this understanding applies to the text and context. In the previous example, the dotplot is able to discern even between scientific fields and create a more fine analysis of the classification of knowledge - a chemistry researcher could use similar words to those in another scientific discipline, but the clustering becomes more accurate.

After a year of working on this project and being in the field since the 1970s, Church stated that he “felt that there has been a lot of effort recently in things like end-to-end systems and one model to rule them all. I believe in diversity and modularity, so I believe that fundamentally that text and context are different and go in different directions, and a lot of this project was trying to work through this tension.”

While Church was the leader of the project, he emphasized the spirit of collaboration at Johns Hopkins’ JSALT conference, the Le Mans Université organizers, and the international Better Together team. In his words, “You put a dozen people together for six weeks, and things happen.”

Church adds, “There is no end! There is always more.” As a member of the Advisory Board for Semantic Scholar, sponsored by the Allen Institute for AI (a Seattle-based non-profit founded by Microsoft’s Paul Allen), he now strives to have this organization incorporate his research and ideas to garner further reach and impact. The Allen Institute emphasizes transparency and sharing all available information as a non-profit service, indicating an opportunity to engage a large community, have the community innovate and build prototypes, and bring the best results to the mainstream.

“Better Together is a trail-blazing approach to break out of error-prone, knowledge-free clustering which can invent facts to support an auto-complete task,” says EAI’s Usama Fayyad, who “looks forward to future breakthroughs based on its development.”

Learn more about Kenneth Church and his team’s continuing work here.