Showing posts with label LLMs. Show all posts
Showing posts with label LLMs. Show all posts

2025-12-16

Internet Search: Yesterday, Today & Tomorrow

In the beginning (yesterday) the Internet was an academic network and then the Free-nets (including the National Capital FreeNet of which I was an early member and information provider) were created providing a place for community organizations and bringing the Internet to the people.

In these early days, before the World Wide Web, the Internet was primarily text based and used search tools known as Archie, Gopher, Veronica and Jughead to search for documents stored online. People also used services such as Usenet to access the equivalent of today’s web forums and IRC (Internet Relay Chat) as a group and private real time messaging service. Most importantly we all had Email, which IMHO is still the most important thing the Internet gives us as individuals.

Then came World Wide Web and HTML and everything changed. The Internet was still non-corporate being primarily educational institutions, non-profit organizations and individuals but that soon changed, many say for the worst, when corporations were allowed onto the network. I would certainly miss online banking and shopping and streaming services have given us access to non-North American “television” we would not have had otherwise.

The WWW gave individuals an opportunity to have their own place on the Internet through personal websites (also called Home Pages back then). Internet Service Providers would provide customers with web storage they could use to create their own web pages using HTML and sites like GeoCities made it even easier. Then came Myspace, a sort of Facebook lite. There were other sites serving the same user base that wanted their own place on the Internet and they all co-existed peacefully. And then came Facebook and everything changed for the worst. Most people criticize Facebook for it’s tracking of users and monetization of their and their “friends” personal information, but to me the most evil thing about it is it’s business model of trying to keep users away from the open Internet and dependent on their proprietary site.

At one time, long before Facebook, there was even a print Internet Yellow Pages that listed all the significant websites on the World Wide Web but it quickly became necessary to have some online tool for people to find what they were interested in without depending on prior knowledge, friends or just luck.

When we started using the Internet for research or to find information were not looking for specific answers to specific questions but for resources where we could find those answers.

And perhaps the best tool for that was the original Yahoo Directory which was a hierarchical listing by subject of web resources curated by librarians to ensure the legitimacy of the sources. Other directories also existed, particularly subject specific ones. As the Internet grew exponentially keeping up a complete directory became an impossible task, or at least economically impossible to compete with search engines that also existed at that time,

In the beginning we used search engines the same as way the Yahoo Directory, to find resources where we could find the information we were seeking. Perhaps the best of the early search engines and my personal preference was Digital Equipment Corporation's AltaVista search engine which allowed users to do a Boolean Search using AND, OR & NOT operators. Soon people started using search engines to find specific answers to specif questions.

Alta Vista and almost all other search engines were surpassed by the original Google search engine whose algorithm impressed everyone so much that it became the dominant search engine. It’s advanced search mode also allowed Boolean searches. It became my (and most peoples) search engine of choice for a long time.

Then came the enshittification of both search and the Internet as a whole.

The enshitification of search happened as Google gained an effective monopoly on Internet search, so much that to search the Internet became “to google” as nearly all searchers were done using Google. And then we saw the gradual degrading of Google as it monetized it’s search engine. We would see promoted links at the top of search results that were paid for. Searches for, as an example, Ford F-150 would have Chevy Silverado as the first listed result because General Motors paid for that. And then we started getting results in the form of answers to questions rather than as links and people referring to “Google said/told me” rather than referring to the sources Google found.

Somewhere along the line the advanced Boolean search capability disappeared from Google and then it became contaminated by LLM chatbots spouting spurious answers and information. It may be possible with enough effort to turn off the AI slop in Google but personally I would not trust that that is so. Google’s once famed reliability is now in the dumpster. And of course Google has become infamous for tracking it’s users.

People have started to slowly move away form Google to privacy supporting search engines like DuckDuckGo, although it has been criticized for it’s optional AI features although it is a lot easier to disable them in settings than with Google. I personally use the non-AI version of Duck Duck Go (https://noai.duckduckgo.com/) which has the AI features disabled. I only wish it had obvious Boolean search capabilities, although there are apparently ways to do Boolean searches and other advanced search techniques for DDG (that I did not know about until I researched this post).

But the enshitification goes beyond Google search and has infected the whole of the Internet/World Wide Web. Over the last 20 years or so we have seen a proliferation of fake news and disinformation sites and social media has increased the amount of misinformation and misinformation online by orders of magnitude.

But the user is also to blame. The reason for Facebook’s success is the fact that consumers today put convenience above all else and when you add the super convenient magic answer machine LLM based AI chatbots that base their answers on whatever is repeated most (the GIGO principle) the result is inevitably garbage.

Tomorrow’s search function requires a better way for those of us more interested in accuracy than convenience. Let us suggest a new model that puts a boolean search engine on top of a directory of trusted sites and builds from there.

We start with an original Yahoo type directory curated by librarians and subject specialists. The directory is hierarchical starting with broader subjects going to lower ones. One can browse or search directory to find the field of knowledge you are interested in and select relevant websites from there.

The curators will not attempt the impossible task of vetting all contents on the websites/resources but they will be selected according to the trustworthiness of those responsible. Different categories of resource will be vetted differential according to their nature.

Information resources on science, the humanities and the social sciences will be judged according to the reliability of the content as ascertained by the trustworthiness of those responsible for them.

There will be a general information category for encyclopedias and similar broad works.

Journalistic sources will be judged again according to the journalistic principles of the organizations, ethical, fact checking, distinguishing opinion from news content, etc.. Sites that are solely expressing opinions will be identified as such and where possible identified according to bias, right leaning, left leaning, etc. Satire sites will be identified as such for those that cannot figure that out.

Political sites will not be vetted according to accuracy but according to whether they are actually who they say they are and not attempts to spoof or misrepresent the opinions of politicians or political organizations. Similarly for corporate and banking sites as a protection against fraud.

Social media sites will be included in the listings for those that seek them out but will not be included automatically in searches.

The next level of search will be the ability to search not just for information resources/websites but also within them like a normal web search but restricted to sites within the directory, as a whole or by specific subject matter, or specific website.

And finally a full internet search will be available where that is desired. The ability to exclude social media sites (and perhaps certain other categories) will be included. All searches will have full Boolean search capability and resources on how to understand and use the Boolean search capability will be provided.

A final capability, which i am on the fence about whether it should be included, is a natural language question search capability with an algorithm to translate that into boolean search terms.

The big question here becomes how can this be funded. Ideally enough users would be willing to pay for accurate search to make it work, but let’s not delude ourselves about the majority of Internet users. So it would probably require some major donors willing to fund it because it is good for society, and hopeful broadly distributed, with small individual donations being at least a significant portion of the funding.

2025-11-27

Postscript AI

When I finished writing my last blog post I started to wonder what it would look like if I had asked an LLM-based AI to write it. I have little doubt it would have been a mishmash of other people’s stolen thoughts, perhaps along with stuff the AI just made up. Even if I had given the AI the subheadings I still cannot imagine any way it would look anything like my thoughts. Perhaps I could have used AI to find the links I used to add documentation to my thoughts, but I would still need to read them all to see if they were appropriate making it no better, likely worse, than just using a search engine to find the links.

Of course the LLM versions of AI are not designed to actually write thoughtful articles but they may be able to create “content” that looks like someone wrote it the same way people steal Wikipedia articles (accounting for the multitude of identical articles all over the web) just to get clicks to ads.

The whole basis of LLM-based AI seems to be based on turning quantity into quality, as if putting more more garbage in will get you better garbage coming out, but it’s still just GIGO and certainly has nothing to do with intelligence.

With their high water and energy consumption LLMs seem like a pretty wasteful way to just create words strung together. Just give monkeys some keyboards instead.

2025-09-01

How to Build an Intelligent Online Answer Machine

Ever since Facebook and Amazon people have become lazier, or perhaps more accurately addicted to convenience over all else, including ethics or accuracy.

When it comes to information we used to search out reliable sources and read information in detail to find answers to our questions Now people just seek to ask so-called “chatbots” the question and accept whatever it gives them based on so called artificial intelligence (AI) which has nothing to do with intelligence or even accurate knowledge, being based on Large Language Models (LLMs) which probe the depths of the Internet to try to guess at what type of answer a real person would give based on all the garbage ever posted on the Internet, ignoring the Garbage In Garbage Out (GIGO) principal.

If people insist on not doing their own research there must be a better way for an “Online Answer Machine” to do it for them.

First you need a decent search engine that can handle AND, OR, NOT and “quotation marks for exact phrase searches” Boolean operators. Google Advanced Search at it’s prime before enshitification would be ideal.

You also need a sophisticated algorithm (some people might call this AI) that can translate natural language questions into Boolean search terms and identify the subject of the question.

The next part is the key to the whole process. You need a human curated database of accurate, reliable and authoritative information sources (web sites or other online sources) indexed by subject matter.

When a question is asked the algorithm would translate it into search terms, determine the subject and search the appropriate sources for that subject to extract an answer for the user, along with citations and links to the sources the answer was taken from.

This certainly will not be as good as doing your own research choosing your own sources but this would not be built for people who want to, or know how to, do their own research.

2024-05-12

The Scourge of the Internet

No I am not writing about the fear and hate mongering taking over the Internet although they are the greatest evils of the Internet. And I am not taking about corporate social media with all it’s evils of turning the customer into the product, at least it can facilitate communication and community and even activism. I am talking about something much subtler and seemingly innocuous.

The Scourge of the Internet are so-called influencers and content creators.

When I think of influencers, The Kardashians are the first thing that come to mind, people famous for being famous. Influencers online are about being famous, and being charismatic or outrageous seems to be the way to go. But influencers are not really out to influence anyone, they are just looking for followers that can be monetized.

As for content providers, the word content says it. They are not about providing real information or knowledge, it’s just about creating something to stick in-between the advertising. That is why when you go researching online you keep finding multiple websites with exactly the same information, word for word (usually stolen from Wikipedia), Content providers are just sticking content they steal in-between the advertising. Again all for hits and advertising revenue.

These things may seem innocuous but they clutter up the Internet with meaningless pap making finding real information increasing more difficult, if not close to impossible. And AI is just going to make everything worse as the LLMs behind it feed on this mountain of garbage for the ultimate GIGO effect.

Can we have our old Internet back please – a place for information, communication and community.

2024-01-03

AI Has Nothing To Do With Intelligence

AI has nothing to do with intelligence but people believe the marketing hype, mostly because we have a distorted idea of what intelligence is, largely due to the media.

Take the quiz show “Are You Smarter Than a Fifth Grader” that says in its name that it’s about whether contestants are as intelligent as a fifth grade student. What the show actually tests is who is more familiar with the grade five curriculum, grade five students or people who have not been in school for twenty tears or more. I know who I am betting on.

And take the famously super intelligent Jeopardy champions. Maybe some of these people are highly intelligent but that is not why they are Jeopardy champions because Jeopardy is not about intelligence. It is about knowing stuff, particularly the type of stuff Jeopardy asks questions about. At best it is about knowledge, not intelligence.

The Cambridge Dictionary defines intelligence as: “the ability to learn, understand, and make judgments or have opinions that are based on reason”. (Source)

I would refine that to: “the ability to understand and analyze information in order to make rational decisions based on that information”.

Intelligence is not about information it is about reasoning.

I remember what some might call the first forerunner to Alexa and other chat bots. It was called Eliza

ELIZA's creator, Weizenbaum, intended the program as a method to explore communication between humans and machines. He was surprised and shocked that individuals, including Weizenbaum's secretary, attributed human-like feelings to the computer program.[3] Many academics believed that the program would be able to positively influence the lives of many people, particularly those with psychological issues, and that it could aid doctors working on such patients' treatment.[3][13] While ELIZA was capable of engaging in discourse, it could not converse with true understanding.[14] However, many early users were convinced of ELIZA's intelligence and understanding, despite Weizenbaum's insistence to the contrary.[6] (Source)

This was not artificial intelligence and neither are the latest claimants, the large language models (LLMs).

A large language model (LLM) is a language model notable for its ability to achieve general-purpose language understanding and generation. LLMs acquire these abilities by learning statistical relationships from text documents during a computationally intensive self-supervised and semi-supervised training process.[1] LLMs are artificial neural networks following a transformer architecture.[2]

As autoregressive language models, they work by taking an input text and repeatedly predicting the next token or word.[3] Up to 2020, fine tuning was the only way a model could be adapted to be able to accomplish specific tasks. Larger sized models, such as GPT-3, however, can be prompt-engineered to achieve similar results.[4] They are thought to acquire knowledge about syntax, semantics and "ontology" inherent in human language corpora, but also inaccuracies and biases present in the corpora.[5]

Notable examples include OpenAI's GPT models (e.g., GPT-3.5 and GPT-4, used in ChatGPT), Google's PaLM (used in Bard), and Meta's LLaMA, as well as BLOOM, Ernie 3.0 Titan, and Anthropic's Claude 2. (Source)

Using statistics to mimic what a human might say or write is not reasoning and it is certainly not intelligence.

It might not be so bad if these systems did not claim to intelligent but only claimed to be able to retrieve accurate information and did that well but they are designed to NOT do that.

I remember the early Internet and search engines with advanced boolean search capability like Alta Vista and the early versions of Google before they sold their top search results to the highest bidder.

Then the Internet was mainly academic institutions and community based organizations. The information on the Internet was relatively reliable most of the time. That information is still there if you pay attention to the actual source.

LLMs could use an information base based on actual reliable sources like Encyclopedia Britannica or Wikipedia, or the collections of actual scientific journals or other respected sources.

But instead they have adopted the bigger/more is better approach feeding as much of the Internet as possible into their models, often without permission of the sources/creators. This leads to an information base dominated by misinformation and disinformation leading to results like “there is no water in the Atlantic Ocean”. But obvious errors are not the danger here but the amplification of misinformation and disinformation in the political sphere.

But it is worse. These disinformation models are proving to be even more wasteful of energy and harmful to the planet than the cryptocurrency scam and their believers/followers just as faithful and misguided. And for what. Obviously they hope to make a shitload of money from this scam.

AI is clearly not intelligent, just dangerous.