Well, my example of the word ‘elephant’ has the same property as ‘herb’ where the use of ‘a’ or ‘an’ can depend on who you ask. I chose my example trying to anticipate this exact question, and I believe I gave you an answer.
Let me put it this way: it depends… It depends on the data the LLM (Chat GPT for example) has been given to train its output. If we have an LLM dataset which uses only text by people in the United Kingdom, then the data will favor “a herb” as the ‘h’ is pronounced, where data from the United States will favor the other way as the ‘h’ is usually silent when spoken out loud.
As a fairly general rule, people use the article “an” before a vowel sound (like a silent “h”) and “a” before a consonant sound (like a pronounced, or aspirated, “h”). Usually the data gathered is from multiple English speaking countries, so both “an herb” and “a herb” will exist in the training data, and from there the LLM will favor picking the one that is shown more often (as the data will biased.)
Just for fun, I asked the LLM running on my local machine.
Prompt: "Fill in the blank: “It is _ herb”
Response: “It is an herb.”
I want to know what it does for words that can be either a or an like herb.
Well, my example of the word ‘elephant’ has the same property as ‘herb’ where the use of ‘a’ or ‘an’ can depend on who you ask. I chose my example trying to anticipate this exact question, and I believe I gave you an answer.
Let me put it this way: it depends… It depends on the data the LLM (Chat GPT for example) has been given to train its output. If we have an LLM dataset which uses only text by people in the United Kingdom, then the data will favor “a herb” as the ‘h’ is pronounced, where data from the United States will favor the other way as the ‘h’ is usually silent when spoken out loud.
As a fairly general rule, people use the article “an” before a vowel sound (like a silent “h”) and “a” before a consonant sound (like a pronounced, or aspirated, “h”). Usually the data gathered is from multiple English speaking countries, so both “an herb” and “a herb” will exist in the training data, and from there the LLM will favor picking the one that is shown more often (as the data will biased.)
Just for fun, I asked the LLM running on my local machine. Prompt: "Fill in the blank: “It is _ herb” Response: “It is an herb.”