I've also noticed some messages have 'null' for message_id column in the database.
Announcement
Collapse
No announcement yet.
Email Fetching Duplicate Messages
Collapse
X
-
I've added some fixes to mail importer class, that I believe should solve the issue that email being imported in parallel processes can cause duplicates.
Could you apply the changed file to your instance manually to check whether it helps?
https://raw.githubusercontent.com/es...l/Importer.php
Comment
-
I added this update and it doesn't seem to help the issue. I have 2 group emails and approximately 30 personal emails in the system. The personal emails have inbox and sent items monitored. We're using Gmail for business.
I ran it for approximately 3 hours today and noticed in my mySQL processes (live view) there were several (7 or so) open sleeping connections with +100 time listed. I was still getting duplicate and triplicate emails in the system and sending from the system. It appeared to be mainly on personal emails. Gmail was also having some issues this afternoon, but when I reverted back to the original Importer.php it seemed to improve.
I've been having random duplicate emails for a while, but very few and far between. Today, before the change to Importer.php I had 15 duplicates within 1 hour. I changed the Importer.php and was fine for about an hour then had 17 duplicate/triplicates the following hour.
I was unable to find the spot in the data/config.php to change anything about the cron execution. Maybe I'm looking in the wrong spot?Last edited by joy11; 01-25-2018, 09:51 PM.
-
The emails do have identical message_id fields in the database. I have the max email portion size set to 20 on personal and 20 on group. I turned off most of my workflow jobs to try and improve performance. This morning I made a change to the 'minExecutionTime' => 60, (before I saw your post below) It didn't seem to hurt anything during the day and I had only 1 duplicate for several hours. This evening I did upgrade to 5.0.3 and have seen several duplicates since the upgrade. I will wait for next release now since my upgrade is clean.
-
-
Version 5.0.3
Max email portion size for personal account fetching: 4,000 (I used 10 originally, but I bumped it up when I realized how long import would take with ~10K messages).
All dependencies met at install, I also added in php mailparse before bringing messages into the system (not mentioned in the installer dependency section if I remember right).
I was only using one personal email account - I did have my SMPT info entered into every place possible in the CRM though - perhaps that's contributing? I'm monitoring both Inbox and Sent folders. My email is provided by Godaddy - I use their Workspace Client online right now.
I do develop - on the surface it does feel like it might be a bit of cron job failing mixed with inadequate data control in the database / processing script to check for duplicate messages. You could create some kind of hash column in the database and create the hash based on the actual message, then make the column a unique ID. The null message_id thing strikes me as being a quirk of email standards being all over the place. Every email I get from one company triggers an email format specification warning in the error logs.
Hope that helps!
Comment
-
You process email in parallel? I'm actually very familiar with PHP in parallel (I wrote an entire parallel processor for an internal program) - how are you processing in parallel? Are you dividing the work into piles? My own processor breaks each worker up with it's own pile of tasks to avoid race conditions, but when it comes to inserting into MYSQL - you can't do parallel inserts to the same database table - they are sequential no matter how many connections you have - if you are generating multiple connections, your wasting your time - it has no impact on insert performance. A better method is to parallel process the data outside the database - if anything can be done outside the database - then recombine edited data and do big insert statements of 5,000 or so records.
Comment
-
I meant multiple cron running in paraller. One script is running, the other script starts before previous is finished. If you have only one personal account that should not happen unless the cron run takes more than 2 hours. This could have happen, in this case system treat the job as failed and starts again (for 3 times before terminate). These params are configurable in data/config.php
I think you need:
1. Drop max portion size to ~100.
2. Apply the changed file I linked above.
Comment
-
Note, that you can still have the job running right now that can take some time to finish.
Cron config params:
PHP Code:
'cron' => array(
/** Max number of jobs per one execution. */
'maxJobNumber' => 15,
/** Max execution time (in seconds) allocated for a sinle job. If exceeded then set to Failed.*/
'jobPeriod' => 7800,
/** Attempts to re-run failed jobs. */
'attempts' => 2
)
Comment
-
Applying inquie key we can't afford for now for the next reasons:
1. Users can already have null and dupllicates in their database.
2. Upgrade will take way too long, since email table are usually big.
You can apply it manually for your database. Exception will be catched and no fatal error should occur.
Comment
-
Ok. Now we are working on this issue. There are two points:
1. Make an ability for jobs to be executed in a long period. More than 2 hours like it is now. We will need to control PID. At lease we can pull it off for unix-based systems.
2. Prevent email duplicates that can occur in extreme conditions, e.g. when multiple processes are executed simultaneously and server run out of resources.
Comment
-
I was looking at your importer code - are you looking for duplicates based strictly on message ID?
protected function findDuplicate(Entity $email)
{
if ($email->get('messageId')) {
$duplicate = $this->getEntityManager()->getRepository('Email')->where(array(
'messageId' => $email->get('messageId')
))->findOne();
if ($duplicate) {
return $duplicate;
}
}
}
A system of hashing out the actual message content into a 255 character key might work better, I saw a ton of NULL message ID's before disallowing null IDs in the database message id column - unless there is an issue deeper in your program that is erasing them.
Comment
-
So I just got off the phone with Godaddy - I'm not sure if this is the issue or not, but my email is very old, it used to be on a POP3 Server but they migrated everyone to an IMAP server. So a lot of my mailbox at one point in time was handled by POP3, I don't know if this affects email formatting but I figure it's worth mentioning. I'm looking at having to just cut ties with the box and move to something more modern and *clean slate*.
Comment
-
I just dug in deeper here, looking at message headers in my workspace email client to see if messages with null message-id fields were missing the 'Message-Id' header, they are.
I also upgraded the importer.php file and reverted back to old database index / column settings for message-id column - tons of duplicates entering the system.
Comment
Comment